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COINCIDENCE DETECTION METHOD, 
PRODUCTS AND APPARATUS 

This is a continuation of International Application PCT/ 
CA98/00273, with an international filing date of Mar. 23, 
1998, now abandoned, which claims benefits of the filing 
date of U^. Provisional application 60/041472, filed Mar. 
24,1997. 

TBCHNICALFLELD 

The invention relates to methods, devices and systems for 
coincidence detection among a multitude of variables. In 
addition, the invention relates to applying coincidence 
detection methods to various fields, and to products derived 
from sudi application. 

BACKGROUND ART 

k-tiq)les of Correlated Attributes 

The discovery of correlations among pairs of k-tuples of 
variables has applications in many areas of science, 
medicine, industry and commerce. For example, it is of great 
interest to physicians and public health professionals to 
know which lifestyle, dietary, and environmental factors 
correlate with each other and with particular diseases in a 
database of patient histories. It is potentially profitable for a 
trader in stocks or commodities to discover a set of financial 
instruments whose prices covary over time. Sales staff in a 
supermarket chain or mail-order distributor would be inter- 
ested in knowing that consumers who buy product A also 
tend to buy products B and C, and this can be discovered in 
a database of sales records. Computational molecular biolo- 
gists and drug discovery researchers would like to infer 
aspects of 3D molecular structure from correlations between 
distant sequence elements in aligned sets of RNAor protein 
sequences. 

One formulation of the general problem which encom- 
passes many diverse applications, and which facilitates 
understanding of the principles described herein is a matrix 
of discrete features in which rows correspond to "objects" 
(such as individual patients, stock prices, consumers, or 
protein sequences) and the columns correspond to features, 
or attributes, or variables (such as lifestyle factors, stocks, 
sales items, or amino acid residue positions). 

Mathematical methods for determining a measure of the 
type, degree, and statistical significance of correlation 
between any two, or even three or four, particular variables 
are widespread and well-understood. These methods include 
linear and nonlinear regression for continuous variables and 
contingency table analysis techniques for discrete variables. 
However, great difficulties arise when one tries to estimate 
correlation — or just estimate joint or conditional 
probabilities — over much larger sets of variables. This 
intractability has one main cause — there are* too many joint 
attribute-value probability density terms — and this mani- 
fests itself in two serious problems: (1) computing and 
storing fiequency counts over all terms, over the database, 
requires too much computation and memory; (2) there i& 
usually an insufiScient number of database records to support 
reliable probability estimates based on those frequency 
counts. 

Let us consider some details. For M records (objects), N 
variables (attributes, fields), and supposing that each vari- 
able has the same set of |A| possible values, there are 



{N-k)lk\ 

^ k-tuples of columns. Adding the number of k-tuples for each 
k«l, 2, . . . , N^ results in 2^-1 such tuples of all sizes. This 
exponential complexity has been a major obstacle standing 
in the way of higher-order probability estimation and cor- 
relation detection methodologies. 

One natural way to think about this complexity is in terms 
of the power set of the set of column variables. This power 
set forms a mathematical lattice under the operation <=■ , a 
"tower" corresponding to a graph whose nodes are subsets 
of this set of column variables. ^Note that if a set has N 

15 members, the power set has 2^ members). From this 
viewpoint, two nodes representing subsets and are 
connected if and only if either a^ c Qj or Oj*^ a^. We say 
that o^s node is above o^'s if cr Qj. This gives a natural 
meaning to the term "higher-order", as appearing higher vp 

20 the tower. We call the bottom, the null set node, the Olh tier; 
the single column terms from the first tier, and so on. 

Continuing with the tower analogy, we note that each 
"floor" of this edifice contains 

25 

(A) 

"suites", and each suite contains [A]* "rooms". In other 
words, the kth level of the lattice corresponds to 

30 

N 

(A) 

different k-tuples of column variables, and associated with 
each k-tuple is an (|A| by |A| . . . by |A|y) contingency table, 
each cell of which must store the counted frequency of a 
particular joint symbol (a,-i, a^, . . . , a,;^ were one to use a 
classical contingency table test for the correlation between 
those particular k columns. (See FIG. la). 
^ For any ke{l, 2, . . . , N}, for any particular k-tuple of 
cohmins (Cy^, Cj^, » - * , Cjj^, there are |A|* possible joint 
values. For any k£{l, 2, . . . , N}, for any particular k-tuple 
of columns (c^^, c^^, . . . , Cjj^, the estimation of KuUback 
divergence or other correlation function using the dauset is 
at least an Q(Mk) or Q(|A|*) computation, depending upon 
the relative sizes of M, k and |A|. 

A comprehensive probabilistic model of the database 
must be able to specify probability estimates for 




55 terms. This means, for example in the computational 
molecular biology domain, that for a tiny heptapeptide 
sequence family, each sequence having a length of seven 
amino acid residues, there are 1,801,088,540 terms to 
specify. For an unreaUstically small RNA of fifteen nucle- 

60 otides in length, over the smaller RNA alphabet of four base 
symbols, there arc 30,517,578,124 terms. 

Clearly the models can become intractably huge. What 
about the space of possible models through which a 
modelling/learning procedure must search? Consider a 

65 latent-variable model, which seeks to explain correlations 
between sets of observable variables by positing latent 
variables whose states influence the observables jointly. 
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Since each model must specify a set of k-tuples of variables, . proposes a hashing method for calculating these special 

and there are 6xp(2, 2^ (i.e., 2 to the power 2^ such sets, terms. Each kth-order potential requires an estimation of a 

there are 6Xp(2, 2^ possible models in the worst-case search kth- ordcr joint probability densitv as well as some num ber 

space. of lower-order (typically k-/th- order) densi ties. Yhc asym p- 

Various methods for determining a measure of higher- 5 lotic time complexity ot Miller's pattcrti-collecli on 

order probabilities will circumvent the combinatorM explo- subroutine, the major componeni oi ihe pote"g!i al 

sion through severe prior restrictions on the width k (See calculation/is, when intcrpretea in nur terminniof^y! 
FIG. 3fl), the locality (FIG. 2a), the number, or the degrees ' 

of correlation of the higher-order features sought, and on the k 

kinds of models entertained (See FIG. 4a). m • V* « OiMN*^) 

Three Goals of Probability Estimation ^ t*> 

It is useful, before discussing details of existing methods 
and of the cunent invention, to delineate diree different 

possible goals of probability estimation in large datasets, where K-k,, is the highest order of features for which o ne 

each corresponding to a large body of research and current will search and bv which one will repre sent database oEi ccts. 

practice: This eicponential blow-up prevents one from searching for 

1. Estimation of the fully-specified, fiiUy higher-order badier-order features (HOFs) of anv order k much higher 
joint probability distribution: Estimate a probability than 4 or 5 m ^a l abas^ with hl i ndi^^ of attrih utes 
density q that specifies Manv methods, m different apphcaUon areas, simply limit 

^ ^ ^ k *^ k""^ F"^ '■vflTT plgr pairwise mter-rcsidue correla tion 

20 methnHs d iscnver second-order features diat can be usef ul in 

q(*n@c<i» "i2@C(2i • • • » a<*®c,4) t he prediction of protein structure and tuncti on and that can 

b c^ built into classltiers more sens itive tnai nrsi-order 

for all k-tuples of attributes and possible values. sequence classifiers and fold-iecognizers. To the extent that 

2. Hypothesis testing, for particular hypotheses concern- k-ary interactions are important, and to the extent that such 
ing particular attributes and particular variables: For 25 interactions leave traces in sets of homologous sequences, 
example, are the data consistent with the hypothesis the pairwise methods are defident One can try to infer k-ary 
that columns c^, c,-^, . . . , c^^ are independent? correlations from sets of 2-ary correlations [9] (essentially 

3. Feature detection, or "data mining": Detect the most by computing the transitive closure of the "CorrelatesWith" 
suspicious coincidences, for example, joint attribute binary relation), but this heuristic can lead to trouble: high 
occurrences that are more probable than would be 30 paiiwise correlations among variables x, y, z do not in 
predicted firom lower-order marginals. Related to this, general imply, nor are they necessarily implied by, a high 
find the most highly correlated k-tuples of columns. 3-ary correlation (as measured by Kullback divergence) of 

It is the feature detection and data mining applications the three variables x, y, z. In other application areas, such as 
that are most relevant to the present invention. However, the study of multiple drug interactions, it is similarly true 
some of the most successful ways to estimate a full higher- 35 that important higher-order relationships can be missed by 
order joint probability distribution of a database require the pairwise correlation detection methods, 
specification of exactly those higher-order terms which The Paturi et al. Method for Identifying the Most Corre- 
represent high correlations among sets of k^2 variables and lation Pair of Random Variables. A method has been 
invoking maximum entropy assumptions, and therefore the reported for the problem of finding the most highly corre- 
current invention is aimed at those applications as well. 40 lated pair X^., Xy of variables from among a large set of N 
Rel ated Work random binary variables X^, X2, . . . , X^y^. The method is 
p^-'Vmoiis mathematical and computational methods have easily extended to finding the most correlated k-tuple of 
been proposed and used to estimate higher-order random binary variables, but at a significant increase in 
probabilities, to detect correlations, and to model higher- computational complexity, and only for k^2 fixed a priori, 
order database relationships. All such prior methods either 45 It uses a definition of correlation that has Correlation (X,, 
perform a global, sometimes exhaustive search through aU Xy)=PpC,«X^-] over some sets of M samples {X^j, X"^, • • • , 
possible k-tuples of variables, which is too costly, or they X'^jv^lm-iA .... a/- (^^^^ ^PQ"^] means "the probability 
avoid the complexity altogether by limiting their search to that variable X, has the same value, or state, as variable Xy). 
only k-tuples of a specific fixed, small size k. (Often, kB2 so Much of the computational complexity, both time complex- 
only pairwise correlations are ever considered). so ity and sample complexity, of their method can be incurred 

Below are listed some representative examples of related in trying to separate two or more nearly equally-correlated 

work. pairs (or k-tiq)les) of variables. 

Assuming Independence between Attributes. The easiest The two variants of the Paturi method are asymptotically 

way to avoid the complexity of higher-order correlations is quadratic and subquadratic in N, respectively, the faster 

justtopretendthattheydonotexist. Many of the algorithms 55 procedure requiring more sampling. When the method is 

and computer programs, historically dominant m some fields extended to search for the biggest k-ary correlation, where 

of application of the current method, simply construct and correlation is now defined as P[X^ioXqo . . . the time 

use a model of the data in which all variables, all attributes, complexity grows to approximately 0(k^N*log^N). Search 

are independent. For example, the modelling of DNA and for highly correlated attribute cliques of width k much 

protein sequences, in computational molecular biology, is eo greater than 5 or 6 in very large datasets is once again ruled 

often done with consensus sequences and profiles, which out. 

assume incorrectly that the different base or amino add Hidden Markov Models. Hidden Markov Models 

residue positions are independent Reliance on such models (HMMs) have been used widely and with increasing success 

can obscure CTUcial functional and structural insights into the in recent years, in both automatic speech recognition and in 

DNA or proteins being modelled. 65 the modelling of protein, DNA, and RNA sequences. 

Prior Limits on k. One proposal for Gibbs models of Although some groups have reported significant success 

database^ is based on the use of Gibbs potentials, and it in modelling protein sequence families and continuous 
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speech data with HMMs, noDetheless theic aie great 
improvements to be made in learning time and model 
robustness by the ''hardwiring^ of pre-selected higher-order 
features into HMMs. (This has been investigated for HMM- 
like recurrent neural netwoiks, in different domains). 

Some of the same reasons why HMMs are very good at 
aligning the protein sequences or recorded utterances in the 
first place, using local sequential correlations, make such 
methods less useful for finding the important sequence - 
distant correlations in data that has already been partially or 
completely aligned. The phenomenon responsible for this 
dilemma is termed ''diffusion". 

A first-order HMM, by definition, assumes independence 
among sequence columns, given a hidden state sequence. 
Multiple alternative state sequences can in principle be used 
to capture longer-range interactions, but the number of these 
grows exponentially with the number of k-tuples of corre- 
lated columns. 

The Agrawal et al. Method for Discovery of Association 
Rules. This method was developed in perhaps the purest data 
mining context, the automatic extraction of knowledge-base 
rules firom databases. It considers a database of M transac- 
tions (objects, rows) and N items (attributes, columns) and 
seeks to extract rules of the form aa>b. It therefore seek 
pairs of attributes a, b such that "transactions that contain a 
tend to contain b", hence those pairs with high values for 
p(b|a). "People who buy CD players tend to buy CDs", is just 
one example suggesting the potential commercial interests 
in such methods. (More generally, one can search for sets of 
attributes with high p(bi, b^, . . . , bjaj, a^, . . - , a^)). 

A rule aB>b is said to have: 

1. confidence c if c % of transactions containing a also 
contain b (hence, roughly, if 

{ p(a) J* (100)^' 

2. isupport s if s % of transactions contain a and b (hence, 
roughly, if 

The goals behind this method are different from the 
objectives of the current invention. However, the different 
objectives are brought closer together if one focuses on the 
Agrawal method's discovery of symmetric rules (so that the 
search is for attribute pairs displaying high values for both 

pia,b) ^ p{a.b) 
_a„d— ). 

and if one reduces the emphasis on support (so that coinci- 
dences that are suspicious, even if occurring rarely, are 
sought). 

The Agrawal method is shown to have 0(||S||-MN) time 
complexity, where ||S|| is the sum of all values Support (a) 
for an expoiientially larg? number of k-tuples a of attributes, 
of any size l^k^N, that reach a particular stage of pro- 
cessing in this procedure. Hence the method is 0(2^ in the 
woist case. A series of empirical tests are performed on what 
they considered to be realistic datasets for their domain. The 
running time of the procedure grew only linearly with the 
number M of transactions, but the number of items, or 
attributes, was held constant at N^olQOO, and their con- 
structed datasets probably contained no correlated k-tuples 
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of width k>10. An analysis of their algorithm, which is based 
on an incremental build-up of kth-order diqucs from k-/th- 
order cliques, makes clear that the method tikes much more 
computation to find wide HOFs (large k) than narrower 
5 HOFs (lower k) of equivalent statistical significance. 

Steeg, Robinson, Deerfield, Lappa — 1993. Some rough, 
heuristic methods have been presented for finding k-tuplcs 
of correlated residues (positions) in sets of aligned protein 
sequences. One of the presented methods employed one 
embodiment of a mdimentary version of the representation 
and detecting coincidences steps of the described herein. 

Alternative methods of, and devices for, finding correla- 
tions between attributes, and applications for those 
correlations, are required. 

15 DISCLOSURE OF THE INVENTION 

In a first aspect the present invention provides a coinci- 
dence detection method for use with a data set of objects 
having a number of attributes. The base method includes the 
following steps: 
^ representing a set of M objects in tenns of a number N^ 
of variables ("attributes"), where an attribute is said to 
occur in an object if the object possesses the attribute; 
sampling a subset of r,- out of the M objects, for each 
iteration among a predetermined number of iterations; 
^ detecting and recording coincidences among sets of k of 
the attributes in each sampled subset of objects, a 
coincidence being the co-occurrence of l^k^N^ 
attributes in the same h,- out of r^ objects in the sampled 
subset, where 0^h,-ir^ 
^ deteraiining an expected count of coincidences for any set 
of k attributes and a predetermined number of iterations 
of sampling and coincidence-counting as described 
above, the determining being performed before sam- 
pling and collecting, at the same time or after sampling 
and collecting; 
comparing, for any set of k attributes and number of 
iterations of sampling and coincidence-counting, the 
observed count versus the expected count of 
^ coincidences, and from this comparison determining a 
measure of conelation (or association, or dependence) 
for the set of k attributes; and 
reporting a set of k-tuples of correlated attributes, where 
a k-tuple of correlated attributes is a set of k of the N^ 
45 attributes which have been determined by this process 
to have a value for a chosen correlation measure above 
a predetermined threshold value. 
In a second aspect the invention provides a coincidence 
detection method for use with a data set of objects having a 
5Q number of attributes, the method comprising the steps of: 
sampling a subset of the data set for a predetermined 
number of iterations, each iteration the sampled subset 
of the data set having for each object the same subset 
of attributes; 

55 detecting, and recording counts of, coincidences in each 
sampled subset of the data set, a coincidence being the 
co-occurrence of a plurality of attribute values in one or 
more objects in a sampled subset of the data set, where 
the plurality of attribute values is the same for each 

60 occurrence, the detecting and recording counts of coin- 
cidences in each sampled subset of the data set being 
performed before, at the same time or after sampling, 
detecting and recording counts of coincidences in other 
subsets; 

65 determining an expected count for each coincidence of 
interest, the determining being performed before, at the 
same time, or after sampling, detecting and recording;* 
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comparing, for each coincideace of interest, the observed 
count of coincidences versus the expected count of 
coincidences, and from this comparison detennining a 
measure of correlation for the plurality of attributes for 
the coincidence; and 
reporting a set of k-tuples of correlated attributes, where 
a k-tuple of correlated attributes is a plurality of 
attributes for which the measure of correlation is above 
a respective pre-determined threshold. 
In any of its aspects the comparison of observed and 
expected coimts may be calculated using a Chemoff bound 
on tail probabilities, and counts may be recorded by storing 
a running total of the count of each coinddenoe over all of 
the sampled subsets. 

In a third aspect the invention provides a method for 
visual exploration of a data set of objects having a number 
of attributes, the method comprising the steps of: 
sampling a subset of the data set for a predetermmed 
number of iterations, each iteration the sampled subset 
of the data set having the same number of objects ^ 
although not necessarily the same objects and having 
for each object the same subset of attributes; 
detecting, and recording counts of, coincidences in each 
sampled subset of the data set, a coincidence being the 
co-occurrence of a plurality of attribute values in one or 
more objects in a sampled subset of the data set, where 
the plurality of attribute values is the same for each 
occurrence, the detecting and recording counts of coin- 
cidences in ea(± sampled subset of the data set being 
performed before, at the same time or after sampling, 
detecting and recording counts of coincidences in other 
subsets; 

detennining an expected count for each coincidence of 
interest, the determining being performed before, at the 
same time, or after sampling, detecting and recording; 

comparing, for each coincidence of interest, the observed 
count of coincidences versus the expected count of 
coincidences, and from this comparison determining a 
measure of oonelation for the plurality of attributes for 
the coinddenoe; and 

reporting a set of k-tuples of correlated attributes to a user 
through a graphical interface, where a k-tuple of cor- 
related attributes is a plurality of attributes for which 
the measure of conelation is above a respective pre- 
determined threshold. 

In a fourth aspect the invention provides a pre-processing 
method for use with a data modelling unit to capture and 
report to the data modelling unit higher order interactions of 
a data set of objects having a number of attributes, the 
method comprising the steps of: 

sampling a subset of the data set for a predetermined 
number of iterations, each iteration the sampled subset 
of the data set having for each object the same subset 
of attributes; 

detecting, and recording counts of, coinddences in each 
sampled subset of the data set, a ooinddcnce being the 
co-occuncnce of a plurality of attribute values in one or 
more objects in a sampled subset, where the plurality of 
attribute values is the same for eadi occurrence, the 
detecting and recording counts of coinddences in eadi 
sampled subset being performed before, at the same 
time or after sampling, detecting and recording counts 
of coincidences in other subsets; 

detennining an expected coimt for each coinddence of 
interest, the determining being performed before, at the 
same time, or after sampling, detecting and recording; 
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conq>aring, for each coincidence of interest, the observed 
count of coinddences versus the expected count of 
coincidences, and from this comparison determining a 
measure of correlation for the plurality of attributes for 
the coincidence; and 

reporting to the data modelling imit a set of k-tuples of 
correlated attributes, where a k-tuple of correlated 
attributes is a plurality of attributes for whidi the 
measure of correlation is above a respective pre- 
determined threshold. 

In a fifth aspect the invention provides a correlation 
elimination method for tise with a data set of objects having 
a number of attributes^ the method comprising the steps of: 

sampling a subset of the data set for a predetermined 
number of iterations^ eadi iteration the sampled subset 
of the data set having for each object the same subset 
of attributes; 

detecting, and recording counts of, coinddences in each 
sampled subset of the data set, a coincidence being the 
co-occurrence of a plurality of attribute values in one or 
more objects in a sampled subset of the data set, where 
the plurality of attribute values is the same for each 
occurrence, the detecting and recording counts of coin- 
cidences in each sampled subset being performed 
before, at the same time or after sampling, detecting 
and recording counts of coinddences in other subsets; 
determining an expected count for each coinddence of 
interest, the determining being performed before, at the 
same time, or after sampling, detecting and recording; 
comparing, for each coincidence of interest, the observed 
count of coinddences versus the expected count of 
coincidences, and from this comparison determining a 
measure of correlation for the plurality of attributes for 
the coincidence; and 
eliminating a set of k-tuples of correlated attributes, where 
a k-tuple of correlated attributes is a plurality of 
attributes for which the measure of correlation is above 
a respective pre -determined threshold. 
In any of the aspects, the objects may be sales 
transactions, each transaction comprising one or more pur- 
chased products, and the attributes may be instances of sale 
of particular products or types of products. The objects may 
be time slices and the attributes may be the status of 
elements in a system. The objects may be time slices and the 
attributes may be prices, or price changes o^ financial 
instruments or oonmiodities. 

In any of the a:^ects the steps of the method may be 
represented by the following pseudo-code: 

0. begin 

1. read (MATRIX); 

2. read (R, T); 

3. compute_first_order_jnarginals(MArRIX); 

4. csets :«{}; 

5. for iter«l to T do 

6. sampled_jows :Brsample(R, MATRIX); 

7. attributes :-get_attributes(sampled_rows); 

8. alL_coinddences :ofii>a_alL-Coinddences(attributes); 

9. for coincidence in all_coindd6nccs do 

10. if cset__already__exists(coinddenoe, csets) 

11. then update_cset(coincidence, csets); 

12. else add_^ew_cset(coinddence, csets); 

13. endif 

14. endfor 
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15. endfor 

16. for cset in csets do 

17. expected :=compiite_expected jiatch_count(cset); 

18. observed :-get_observed__match__count(cset); 

19. stats :-update_stats(cset, hypoth_test(expected. ^ 
observed)); 

20. endfor 

21. prmt_final_stats(csets, stats); 

22. end 

In a sixth aspect the invention provides a coincidence 
detection system for use with a data set of objects, each 
object having a plurality of attributes, the system compris- 
ing: 

means for sampling a subset of the data set for a prede- 
termined number of iterations, each iteration the 
sampled subset of the data set having for each object 
the same subset of attributes; 
means for detecting, and recording counts of, coinci- 
dences in each sampled subset of the data set, a 20 
coincidence being the co-occurrence of a plurality of 
attribute values in one or more objects in a sampled 
subset of the data set, where the plurality of attribute 
values is the same for each occurrence, the detecting 
and recording couints of coincidences in each sampled 25 
subset being performed before, at the same time or after 
sampling, detecting and recording counts of coinci- 
dences in other subsets; 
means for determining an expected count for each coin- 
cidence of interest, the detennining being performed 30 
before, at the same time, or after sampling, detecting 
and recording; 
means for comparing, for each coincidence of interest, the 
observed count of coincidences versus the expected 
count of coincidences, and from this comparison deter- 35 
mining a measure of correlation for the plurality of 
attributes for the coincidence; and 
means for reporting a set of k-tuples of correlated 
attributes, where a k-tuple of correlated attributes is a 
plurality of attributes for which the measure of corre- 40 
lation is above a respective pre-determined threshold. 
In the system of the sixth aspect, the means for sampling 
a subset of the data set may comprise means for dividing the 
data set into subsets for sampling. The means for detecting 
and recording cotmts of coincidences may comprise an array 45 
of processing nodes, each processing node detecting and 
recording a respective subcount of coincidences, and the 
means for comparing, for each coincidence of interest, said 
observed count of coincidences to said expected count of 
coincidences may comprise means for merging said sub- so 
counts to provide said observed count. At^least one of said 
processing nodes may comprise a respective subarray of 
processing nodes that detect and record respective subsub- 
coimts of coincidences, and said means for merging merges 
said subsubcounts to provide said subcounts and/or said 55 
observed count. Each proccssinf^ node may compri se 
memory includi ng an input buffer for storing receiv ed 
gbsets ot tne aaia set and an output bufferfor stogag the 
subcd uilt ui the aubau b oo uutrg &d a memoryj jus that trans- 
fers dai a 1 6 and Ituunne memo ry — 60 

in a sevenm aspect the invention provides coincidence 
detection programmed media for use with a computer and 
with a data set of objects having a number of attributes, the 
programmed media comprising: 

a computer program stored on storage media compatible 65 
with the computer, the computer program containing 
instmctions to direct the computer to: 
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sample a subset of the data set for a predetermined 
number of iterations, each iteration the sampled subset 
of the data set having for each object the same subset 

of attributes; 

detect and record counts of coincidences in each sampled 
subset of the data set, a coincidence being the 
co-occurrence of a plurality of attribute values in one or 
more objects in a sampled subset of the data set, where 
the plurality of attribute values is the same for each 
occurrence, the detecting and recording counts of coin- 
cidences in each sampled subset being performed 
before, at the same time or after sampling, detecting 
and recording counts of coinddenoes in other subsets; 

determine an expected count for each coincidence of 
interest, the determining being performed before, at the 
same time, or after sampling, detecting and recording; 

compare, for each coincidence of interest, the observed 
count of coincidences versus the expected count of 
coincidences, and from this comparison determine a 
measiu-e of correlation for the plurality of attributes for 
the coincidence; and 

report a set of k-tuples of correlated attributes, where a 
k-tuple of correlated attributes is a plurality of 
attributes for which the measure of correlation is above 
a respective pre-determined threshold. 

In an eighth aspect the invention provides a coincidence 
detection system for use with a data set of objects having a 
number of attributes, the system comprising: 
a computer; and 

a computer program on media compatible with the 
conqiuter, the computer program directing the computer 
to: 

sample a subset of the data set for a predetermined 
number of iterations, each iteration the sampled subset 
having for each object the same subset of attributes, 

detect, and record counts of, coincidences in each 
sampled subset of the data set, a coincidence being the 
co-occurrence of a plurality of attribute values in one or 
more objects in a sampled subset of the data set, where 
the plurality of attribute values is the same for each 
occurrence, the detecting and recording counts of coin- 
cidences in each sampled subset being performed 
before, at the same time or after sampling, detecting 
and recording counts of coinddeoces in other subsets; 

determine an eicpected count for each coincidence of 
interest, the determining being performed before, at the 
same time, or after sampling, detecting and recording, 

compare, for each coincidence of interest, the observed 
count of coinddences versus the expected coimt of 
coincidences, and from this comparison determine a 
measure of correlation for the plurahty of attributes for 
the coincidence, and 

report a set of k-tuples of correlated attributes, where a 
k-tuple of correlated attributes is a plurality of 
attributes for whidi the measure of conelation is above 
a respective pre-determined threshold. 

In any of its aspects the methods of the invention may 
further comprise the step of 

representing the objects and attributes in a matrix of objects 

versus attributes prior to sampling the data set, the data set 

being sampled by sampling the matrix. 

In a ninth aspect the invention provides a product having 
a set of attributes selected by: • 

sampling a subset of a data set representing objects versus 
attributes for a predetermined number of iterations, 



09/21/2004, EAST Version: 1-4.1 



. us 6,493,637 Bl 

11 12 

each iteration the sampled subset having the same dues or portions of the motif . Itie ligand, by interacting with 

number of objects although not necessarily the same the motif, may interfere with function of a region of the 

objects and having for eadi object the same subset of protein comprising the motif. 

attributes ^ * thirteenth aspect the invention provides a diagnostic 

detecting, a^id recording counts of, coincidences in each 5 agent comprising a ligand that interacts with a pcote^ 

sampled subset of the data set, a coincidence being the ^^^g * slnictural moUf identified using the method of the 

co-ciccurrence of a plurality of attribute values in one or ^V^f mvenUon, and a detectable label hnked 

more objects in a sampled subset of the data set, where to the ligand. .... . 

the phKaHty of attribute values is the same for each 1° a fourteenth aspect the mvenUon provides a pharma- 

ocairrence, the detecting and recording counts of coin- lo ceuUtal compositton for mteracttng with an envelope pro- 

cidences in each sampled subset being performed «f human mimuDodeflciency vmis m "he envelope 

before, at the same time or after sampling, detecting P^lff "«=1"^S » • ''^'^ 

and recording counts of coincidences in other subsets, fP*»^ coordinates of residues A18/Q31/H33, ODmpnsing a 

. . ^ , » . . .J J. hcand mcludine at least one functional group that interacts 

determinmg an expected count for each comcidence of ^^^^ ^ phannaceutically acceptable carrier or 

mterest, the determmmg being performed before, at^to^ .^^^ ^^^^^^^^ may include at least one 

same time, or after samplmg, detectmg and recording f^^t^^^l group capable of binding to and being present in 

comparmg, for each comcidence of interest, die observed an effective position in said ligand to bind to residue 18, at 

count of comcidences versus the expected count of j^^^^ functional group capable of binding to and being 

coincidences, and from this comparison dcteraiimng a ^^^^^ ^ ^ effective position in said ligand to bind to 

measure of correlation for the plurality of attributes for ^^^^^^^ ^^^j functional group capable of 

the coincidence, and binding to and being present in an effective position in said 

reporting a set of k-Uiples of correlated attributes, where Ugaud to bind to residue 33. 

a k-tuple of correlated attributes is a plurality of ^ fifteenth aspect the invention provides a method of 

atUibutes for which the measure of correlation is above ^ designmg a Ugand to interact with a structural motif of an 

a respective pre -determined threshold. envelope protein of human immunodeficiency virus (HIV), 

In a tenth aspect the invention provides a product defined method comprising the steps of: providing a template 

by applying a set of rules generated from: having spatial coordinates of residues A18, Q31 and H33 in 

sampling a subset of a data set representing objects versus the V3 loop of HIV envelope protein, and computationally 

attributes for a predetermined number of iterations, ^ evolving a diemical ligand using an effective algorithm with 

each iteration the sampled subset having for each object spatial constraints, so that said evolved ligand includes at 

the same subset of attributes, least one effective functional group that binds to the motif. 

detecting and recording counts of coincidences in each The ligand may comprise at least one functional group 

sampled subset of the data set, a coincidence being the capable of binding to and being present in an effective 

co-occurrence of a plurality of attribute values in one or 35 position in said ligand to bind to residue 18, at least one 

more objects in a sampled subset of the data set, where fimctional group capable of binding to and being present in 

the plurality of attribute values is the same for eadi an effective position in said ligand to bind to residue 31, and 

. occunence, the detecting and recording counts of coin- at least one frinctional group capable of binding to and being 

cidences in each sampled subset being performed present in an effective position in said ligaiid to bind to 

before, at the same time or after sampling, detecting ^ residue 33. 

and recording counts of coincidences in other subsets, In a sixteenth aspect the invention provides a method of 

determining an expected count for each coincidence of identifying a ligand to bind with a structural motif of an 

interest, the determining being performed before, at the envelope protein of human immunodeficiency virus (HIV), 

same time, or after sampling, detecting and recording, the method comprising the steps of: providing a template 

comparing, for each coincidence of interest, the observed as having spatial coordinates of A18, Q31 and H33 in the V3 

count of coincidences versus the expected count of loop of HIV envelope protein; providing a data base con- 

coinddehces, and from this comparison determining a taining structure and orientation of molecides; and screening 

measure of correlation for the plurality of attributes for said molecules to determine if they contain effective moi- 

the coincidence, and eties spaced relative to each other so that the moieties 

reporting a set of k-tuples of correlated atu-ibutes, where so interact with the motif. A first moiety of the molecule may 

a k-tuple of conelated attributes is a plurality of interact with residue 18, a second moiety of the molecule 

attributes for which the measure of correlation is above interacts with residue 31 and a third moiety of the molecule 

a respective pie-determined threshold. interacts with residue 33. 

In any aspect the methods of the invention may further . In a seventeenth aspect of the invention the invention may 

comprise the step of applying rules that are defined by the 55 provide antigens and vaccines embodying the covarying 

reported correlated attributes. k-tuples described herein. 

In an eleventh aspect the invention provides a peptide or In an eighteenth aspect the invention provides a product 

peptidomimetic including a structural motif of the V3 loop being defined by its interaction with a set of attributes 

of HIV envelope protein including spatial coordinates of selected by: 

residues A18/Q31/H33. 60 sampling a subset of a data set representing objects versus 

In a twelfth aspect the invention provides a phaimaceu- attributes for a predetermined number of iterations, 

tical composition comprising a Ugand that interacts with a each iteration the sampled subset of the data set haying 

protein having a structural motif identified using the method the same number of objects although not necessarily 

of claim 2, and a phannaceutically acceptable carrier or the same objects and having for each "object the same 

exidpient therefor. The ligand may comprise chemical moi- 65 subset of attributes, 

eties of suitable identity and spatially located relative to each detecting, and recording counts o^ coinddences in each 

other so that the moieties interact with corresponding resi- sampled subset of the data set, a coinddence being the 
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co-occurrence of a plurality of attribute values in one or 
more objects in a sampled subset, where the plurality of 
attribute values is the same for eadi occurrence, the 
detecting and recording counts of coincidences in eadi 
sampled subset being performed before, at the same 
time or after sampling, detecting and recording counts 
of coincidences in other subsets, 
determining an expected count for each coincidence of 
interest, the determining being performed before, at the 
same time, or after sampling, detecting and recording, 
comparing, for eacb coincidence of interest, the observed 
count of coincidoices versus the expected count of 
coincidences, and from this comparison determining a 
measure of correlation for the plurality of attributes for 
the coinddenoe, and 
reporting a set of k-tuples of correlated attributes, where 
a k-tuple of correlated attributes is a plurality of 
attributes for which the measure of correlation is above 
a pre-determined threshold. 
In any of the aspects the objects may be compounds and 
the attributes may comprise particular chemical moieties. 
The objects may be peptides or proteins and the attributes 
may comprise particular stmctural or substructural patterns 
or moti£5. The objects may be selected from the group 
consisting of compounds, molecular structures, nucleotide 
sequences and amino acid sequences and the attributes may 
be features of the selected objects. Hie objects may be time 
slices and the attributes may be biological parameters of 
genes or gene products. The objects may be documents that 
are electronicdly stored and/or electronically indexed and 
the attributes may be topics. The objects may be customers 
and the attributes may comprise products purchased or not 
purchased by those customers. The attributes may further 
comprise mailings made or not made to the customers. The 
objects may comprise products and the attributes may com- 
prise customers that have or have not purdiased those 
products. The attributes may further comprise demographic 
variables of the customers. The objects may be people with 
a particular disease or disorder and the attributes may be 
potential contributing factors for the disease or disorder. The 
objects may be people with a number of different diseases or 
disorders and the attributes may be potential contributing 
factors for the diseases or disorders. The objects may 
comprise factors potentially contributing to a disease or 
disorder and the attributes may be people with or without 
those factors, in which case the method associates groups of 
people of substantially equivalent risk for the disease or 
disorder. 

The objects may be time slices and the attributes may 
comprise the state of components in a system at time slices 
prior to failure of the system, in which case the method 
associates component states that may potentially cause fail- 
ure of the system. 

In the first aspect r, may be the same for every iteration. 

In any of the aspects die method provided may further 
comprise the steps of first creating a database of transitions 
between system states, wherein a system state is represented 
by a value of a state variable, over a chosen time quantum, 
and presenting the database, in vfholQ or part, as a data set 
such that each state to state transition set corresponds to one 
of M objects and so that each state variable corre^onds to 
an attribute. 

In any of its aspects the method provided may further 
comprise the steps of first creating* a database of states and 
actions covering a chosen time quantum and presenting the 
database, in whole or part, as a data set such that eacb 
state/action/state triple corresponds to one of M objects and 
so that each state variable or action type corresponds to an 
attribute. 
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In a nineteenth aspect the invention provides a coinci- 
dence detection method for use with a data set of objects 
having a number of attributes represented in a matrix of 
objects versus attributes, the method comprising the steps 
5 of: 

sampling a subset of the matrix for a predetermined 
number of iterations, each iteration the sampled subset 
of the matrix having for each object the same subset of 
attributes; 

detecting, and recording counts of, coincidences in each 
sampled subset of the matrix, a coinddenoe being the 
co-occurrence of aphirality of attribute values in one or 
more objects in a sampled subset of the matrix, where 
the plurality of attribute values is the same for each 
occurrence, the detecting and recording counts of coin- 
ddences in each sampled subset being performed 
before, at the same time or after sampling, detecting 
and recording counts of coinddences in other subsets; 
^ determining an expected count for each coincidence of 
interest, the determining being performed before, at the 
same time, or after sampling, detecting and recording; 
comparing, for each coincidence of interest, the observed 
count of coinddences versus the expected count of 
25 coincidences, and from this comparison determining a 
measure of correlation for the plurality of attributes for 
the coincidence; and 
reporting a set of k-tuples of correlated attributes, where 
a k-ti^le of correlated attributes is a plurality of 
30 attributes for which the measure of correlation is above 
a respective pre-determined threshold. 
In the first aspect numerical correlation values may be 
reported along with the set of k-tuples of correlated 
attributes. 

BRIEF DESCRIPTION OF DRAWINGS 

For a better understanding of the present invention and to 
show more clearly how it may be carried into effect, 
reference will now be made, by way of example, to the 
40 accompanying drawings which show the preferred embodi- 
ment of the present invention and in which: 

FIG. la is a depiction of a power set of a set with N=6 
objects, arranged as a lattice imder a subset operation, 
representing all possible - K-triples of colimuis from the 
^5 power set. 

FIG. 1!> is a depiction of the relative portions of aH lattice 
nodes shown (dark squares) or omitted (light squares) by 
FIG. la. 

FIG. 2a is a depiction of n-grams for all sizes n-1,2, .... 
6 for the power set of FIG. la. 

FIG. 26 is a depiction of the relative portion of all lattice 
nodes shown or omitted in FIG. 2a with a subset of the terms 
highlighted. 

55 FIG. 3a is a depiction of all possible pairwise correlations 
for the power set of FIG. la, corresponding to analysis of the 
third tier up from the bottom of the lattice. This is a shortcut 
taken in work on inter-residue correlations in protein and 
RNA sequence families, for example. In another example, 

gQ this Figure represents the approach taken by a method that 
simply finds all pairs of sales items that tend to be purchased 
together by consumers. 

FIG. 3b illustrates the relevant correlations firom FIG. 3a 
out of the powerset of FIG. la. 

65 FIG. 4a is a depictbn of a partition of the variables of the 
objects of the power set of FIG. la. A partition is one 
particular and important kind of componential model of a 
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sequence family or other aligned dataset. In a componential 
model, a set of latent variables is found to "generate" 
or "explain" a larger set of N observable variables C;. In a 
partition model, N^^N, each Cy is generated by exactly one 
of the y^., and typically Ny<N. The obseiyables correspond- 
ing to one latent variable form a kind of chque, and 
presumably arc highly correlated with each other and rela- 
tively uncorrected with variables outside the clique. In FIG. 
4fl, the observables are fonned into three cliques: (Cj, (C2, 
C5, C^, and (C3, C4). 

HG. 4b illustrates the partition of FIG. 4a out of the 
power set of FIG. la. 

HG. 5/z is a depiction of three iterations of sampling of a 
dataset in accordance with one embodiment of the invention. 

FIG. 56 is a depiction of the three iterations of sampling 
of FIG. 5a with explanatory notes. 

FIG. 6 is a general flow diagram of a program method of 
a preferred embodiment, 

FIG. 7 is a schematic diagram of a system implementing 
the program method of FIG. 6a, 

FIG. 8 is a general flow diagram of the program method 
of FIG. 6a adapted to control a process for production of a 
product, 

FIG. 9 is a schematic diagram of a system implementing 
the adapted program method of FIG. 8. 

FIG. 10 is a general flow diagram of the program method 
of FIG. 6a adapted to generate rules for a rules based system 
that in turn produces a product, 

FIG. 11 is a schematic diagram of a system implementing 
the adapted program method of FIG. 10, 

FIG. 12 is a general flow diagram of the program method 
of FIG. 6a adapted to generate rules used to control a process 
for production of a product, 

FIG. 13 is a schematic diagram of a system implementing 
the adapted program method of FIG. 12, 

FIG. 14 is a diagram of a node of a hardware implemen- 
tation of a preferred embodiment. 

FIG. ISa is a diagram of residues for given sequences for 
the sample 3D structure of FIG. ISb where coincidence of 
sequences may indicate conserved, physical or structural 
relationships. 

FIG. ISb is a diagram of a 3D structure for a sample 
protein. 

FIG. 16 is a diagram of steps in tertiary structure predic- 
tion which can employ the methods described herein. 

MODES FOR CARRYING OUT THE 
INVENTION 

As previously set out, a base method described herein 
employs the steps of: 

representing a set of M objects in terms of a number N^ 
of variables ("attributes"), where an attribute is said to 
occur in an object if the object possesses the attribute; 

sampling a subset of r, out of the M objects, for each 
iteration among a predetermined number of iterations; 

detecting and recording coincidences among sets of k of 
the attributes in each sampled subset of objects, a 
coincidence being the co-occurrence of l^k^N^ 
attributes in the same b,. out of r.- objects in the sampled 
subset, where 0^h,Sr,; 

determining an expected count of coincidences for any set 
of k attributes and a predetermined number of iterations 
of sampling and coincidence-counting as described 



3,637 Bl 

1^ 

above, the determioing being performed before sam- 
pling and collecting, at the same time or after sampling 
and collecting; 

comparing, for any set of k attributes and nxmiber of 
5 iterations of sampling and coincidence -counting, the 
observed count versus the expected count of 
coincidences, and from this comparison determining a 
measure of correlation (or assodation, or dependence) 
for the set of k attributes; and 

reporting a set of k-tuples of correlated attributes, where 
a k-tuple of correlated attributes is a set of k of the N^ 
attributes which have been determined by this process 
to have a value for a chosen correlation measure above 
a predetermined threshold value. 

An alternative base method can include the following 
steps: 

sampling a subset of the data set for a predetermined 
number of iterations, each iteration the sampled subset 
of the data set having for each object the same subset 

^ of attributes; 

detecting, and recording counts of, coincidences in each 
sampled subset of the data set, a coincidence being the 
co-occurrence of a plurality of attribute values in one or 

^ more objects in a sampled subset of the data set, where 
the plurality of attribute values is the same for each 
occurrence, the detecting and recording counts of coin- 
cidences in each sampled subset of the data set being 
performed before, at the same time or after sampling, 

^ detecting and recording counts of coincidences in other 
subsets; 

determining an expected count for each coincidence of 
interest, the determining being perfonned before, at the 
same time, or after sampling, detecting and recording; 
35 comparing, for each coincidence of interest, the observed 
count of coincidences versus the expected count of 
coincidences, and from this comparison determining a 
measure of correlation for the plurality of attributes for 
the coincidence; and 
40 reporting a set of k-tuples of correlated attributes, where 
a k-tuple of correlated attributes is a plurality of 
attributes for which the measure of correlation is above 
a respective pre-determined threshold. 
The modes described herein provide extensions to the 
45 base methods described above and employ similar prin- 
ciples. The principles of one application as described herein 
may be applied to the others as appropriate. Thus, the 
description of all elements of an application will not always 
be repeated for each application. 
50 In the preferred embodiment it is preferred for simplicity 
of programming and interpretation to use a matrix where the 
objects are rows and the attributes are columns; however, 
this is not strictly required and any of the embodiments can 
utilize a data set of objects and attributes that are not 
55 represented in the form of a matrix by sampling subsets of 
the data set directly. As known to persons skilled in the art, 
any relational database can be easily transformed into a 
2-dimensional matrix fomat 
Th^ embodiments described herein lend themselves par- 
60 ticularly well to parallel processing as the steps of detecting, 
recording and counting coincidences for each of the r 
samples can be performed simultaneously across many 
different samples or other subsets of the data set. 

Eac^ of the features or variables describing an object may 
65 be numerical or qualitative. If qualitative, a feature or 
variable described in terms of some number z of levels or 
qualities may be transformed into a numerical variable with 
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z possibk values or states. A numerical variable with z 
possible values or states may be transfonned into z binary 
variables, termed attributes. A numerical variable or feature 
with a continuous range of possible values or levels may be 
trai^onned into, or represented by, a variable with z pos- 
sible values or states and therefore may also be transformed 
into, or represented by a set of z binary attributes. 

More formally, assume that we are given a database of M 
objects O^, O2, ...» each of which is characterized by 
particular values a^yEA^. for each of N discrete-valued vari- 
ables Vy. A particular value for a particular variable is 
denoted ai@Vy. One may start with continuously-valued 
variables and use any of several known methods to quantize 
them into discrete variables. We also note that, in many 
applications, the same alphabet A of possible values is used 
for all the variables. Each object might be a particular record 
in a database, or may be a sample from a random source. 
^ If the initial N variables are not binary then they can be 
converted into a set of attributes. For example, in the 
input listing attached in Appendix "B" each amino acid 
position is a variable that has 20 possibilities corresponding 
to the 20 naturally occurring amino acids represented by a 
subset of letters from the alphabet. In order to turn the 
variables into binary attributes, , each variable becomes 20 
different attributes having 1 of 2 states, such as "A" or "not 
A", "B" or not "B", and so on. An embodiment for repre- 
senting variables of this type is included in the source code 
listing in Appendix "A". Other techniques for representing 
data as attributes could be used. 

The principles set out in this description can also be 
extended to higher orders of attributes, for example trinary 
attributes to be used with higher order computing machines. 
The binary examples used herein are the simplest to imple- 
ment . — 
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each possible k-tuple of attributes. For example, for k-1, we 
have q(cy): A^-*[0, 1], and we might have for some datasct 
q(B@2)o0.33. A distribution also specifies higher-order 
probabilities, like, for example, q(B@2, F@6)«0.166. liiher- 
ent in the particular problems posed is the problem of 
estimating or approximating the distribution q( ), or at least 
parts of it. 

The problem is to find some, or all, k-tuples of columns 
(cyi, Cj2» "'fCjk)f for k=2 ... Hi, whose correlation is greater 
than some predetermined value. For example, one may want 
a procedure which, given an M-by-N table of values, returns 
a list of k-tuples of column indices (ji02> • • • » ik, such that 
D(q(v/i, v^^ . . . , vyjn^.i . . . *q(vp))>PA for some real 
number pj^ Here D(pi|p2) is the Kullback divergence 
measure, which in this case estimates the difference between 
the observed distribution of values over the column vari- 
ables versus the distribution wherein all the column vari- 
ables are statistically independent The Kullback measure is 
just one of many possible measures of correlation or asso- 
ciation applicable to this type of problem. 

For our purposes we consider correlation in terms of 
deviation firom statistical independence. One can compare 
an observed number of occurrences of some event in view- 
ing the database versus the number of expected if an 
underlying hypothesis of independent variables were true. 
That is, the problem is: Given the table of values, for all 
k-2 . . . H4, return a list of all k-tuples of attributes (a,-i@Cfi, 
ao@Ci2» - y ^flbSCifc) such that 
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This situation can be represented by a table in which each 
row stands for an object, each column stands for an attribute, 
and in which therefore each table entry a,^ stands for the fact 
of the ith object having value written at a,y for the jth 
variable. We can also write Cj (for "cohimn j") and an 
attribute as a,@Cy, 

For example, consider this small matrix of six rows 
(objects) and six coltmins (variables). 



35 



coll 


C012 


col3 


col4 


col5 


C0I6 


45 


A 


B 


C 


D 


E 


F 




w 


U 


C 


V 


E 


G 




z 


L 


c 


M 


W 


M 




V 


U 


c 


V 


A 


0 




A 


B 


c 


D 


Z 


z 


50 


w 


L 


c 


M 


E 


z 






t 




t 









55 



Object number 1 has value 'A* for variable 1, *B' for 
variable 2, *C* for variable 3, and so on. For some 
applications, it might be useful to find out that, for example, 
variables 2 and 4 are correlated. In the toy (small fictional) 
matrix example above, this correlation appears plausible, 
because whenever an object has B@2, it also has D@4; 
whenever an object has L@2, it has M@4; and whenever an 60 
object has U@2, it also has V@4. Attribute number 3 docs 
not vary— every object has the attribute C@3, and therefore 
it does not conelate in an interesting way with any other 
variable. 

Given a matrix of data, we further assume that there is 
some "true" underlying probability distribution q( ) which, 
for all order k"l, 2, . . . , specifies the probabilities for 



P(Obscrved(fla@c,i, aa@Crz> ■ ■ • 
Ci2,. . . , Cfj), Model)<e, 



for some observed behaviour of (a,i@c,i, a,-2@CQ, . . . , 
a^@c^, for some real number threshold 9,^0, 1], and some 
Model which underlies one's estimation or hypothesis test- 
ing method. 

Jhc sampling ^nhprnce^ may be random sampling, and 
if random it may be subject to any o f a number of possible ' 
probability dist nbuiions over me opjeds, mcluding a uni - 
^ lormaistribution. Similarl y, there may be constramts on the 
statistical independence or dependencies between each of 
the T samples drawn during the operation of the method, and 
between each of the r objects drawn within one sample. 
Sampfft Ai;<vflT^t|^(y^^ f^f Preferred Em bodiments . 



lere is at least one class of problems, arising in many 
diverse application areas, on which the comparative advan- 
tages of the coincidence detection method and apparatus 
described above and further to be described below are most 

apparent. Such problems are characterized by: 

1. a large number of attributes (columns, in our 
representation); 

2. the possible existence of some number of cHques of 
highly mutually correlated attributes in the dataset, 
each member attribute of each such clique being rela- 
tively uncoirelated with attributes outside its own 
clique; and 

3. lack of prior knowledge as to the precise number, width 
(k, as in k-ary correlation and kth-order feamre), and 
location of sudi attribute cliques. 

All other procedures of which we are aware either place 
prior limitations on the width k of discoverable k-tuples, or 
implement an exhaustive search, serial or parallel, over all or 
nearly all possible k-tuples of attributes. To put it more 
simply, the method of the preferred embodiment takes 
approximately the same computation time and memory to 
find a 44-ary correlation as it takes to find a 2-ary correlation 
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in the same very high dimensional dataset. Most prior Alamos group could only discover pairwise covariation, we 

methods, in contrast, either rule out the discovery of the describe herein k-ary residue covariation, where k>2. That 

44th-order feature or else require the allocation of orders of is, we have identified previously unrecognized moti& of 

magnitude more time or space in order to find it. HIV envelope protein. 

Sample Applications of Preferred Embodiments 5 For a particular trial, input consisted of the respective 

Modellers of very large data sets arc thwarted in their amino acid sequences ofV3 regions from 657 different virus 

attempts to compute very far into a fully higher-order isolates, and is shown in Appendix "B". Source code used on 

probabilistic model by both the computational complexity of the input is shown in Appendices "A" and "D", named "File 

the task and by the lack of data Deeded to support statisti- coinc.pl" and "File probsort.pl", respectively. Output is 

caUy significant estimates of most of the higher-order terms, lo shown in Appendix "C*. 

The preferred embodiment computes only a subset of Referring to Tables C.l through C.9 set out elsewhere 

higher-order probabifities, and extracts a limited selection of below, the results of 6 separate trials are shown. Parameter 

higher-order feature ("HOFs") for construction of a database values are as indicated in the respective legends. In each 

model. Efl&dent use can be made of Umited computing Table, the results arc ordered by statistical significance, with 

resources by pre-selecting sets of higher-order features using 15 the most significant correlation first, and the standard one- 

the conclation-detection mediods described herein, and letter amino acid code is employed. Thus, referring to Table 

building the most significant (statistically and in terms of C,6, the most significant coincidence observed is the occur- 

application-specific aiteria) into model-based classifiers rence of alanine (A) at residue 18, glutamine (Q) at residue 

and predictors based on existing statistical, rule-based, neu- 31, and histidine (H) at residue 33. This, like other coinci- 

ral network, or grammar-based methods. The pre-selected 20 dences set forth on the cited pages, represents the identifi- 

sets of HOFs can be used to create rules for sudi systems. cation of a structural motif of the HIV-1 V3 loop which 

For example, a data set may be analysed using the methods comprises these residues. 

set out herein to determine that if a company is filing a patent Continuing with the particular example of A18/Q31/H33, 

application then it should file an assignment from the the V3 structural motif comprising these residues presum- 

inventor. This rule is then used in the system to generate 25 ably exists on the exterior of the virus particle, and that 

assignments whenever it is determined that a company is region of the V3 loop likely performs a specific function 

filing a patent application. Many rule-based networks could which requires the particular structural motif. Thus, the 

benefit from pre-processing using the methods described stmctural motif would have to be conserved after mutation 

herein, see for example, the System and Method for Build- (s) to preserve that function. This reasoning is extended to 

ing a Computer- Based Rete Pattern Matching Network of 30, other coincidences identified herein, 

Grady et al. described in U.S. Pat. No. 5,159,662 issued Oct. The identification of a particular conserved structural 

27, 1992; the inference engine of Highland et al. described motif of HIV has several uses. 

in U.S. Pat. No. 5,119,470 issued Jun. 2, 1992; and the Fast Using techniques known in the art, a peptide embodying 

Method for a Bidirectional Inference of Masui et al. the motif could be produced for use as an antigen, 

described in U.S. Pat. No. 5,179,632 issued Jan. 12, 1993. 35 Accordingly, a vaccine could be prepared. The peptide 

The discovered HOFs can altematively be used directiy to embodying the motif might be made using known recom- 

create products, for example, in the prediction or determi- binant methods, as are described generally, for example, in 

nation of protein structure, when fed into existing methods Maniatis et al., Molecular Cloning: A Laboratory Manual, 

based on distance geometry or empirically-estimated pat- Cold Spring Harbor Laboratory, Cold Spring Harbor, N.Y. 

terns of cooperativity and folding, or in marketing schemes 40 (1982) and in Sambrook el al., Molecular Cloning: ALabo- 

based on correlated product sales informatioa ratory Manual (2"'' Edition), Cold Spring Harbor 

Later below, practice of the principles described herein Laboratory, Cold Spring Harbor, N.Y. (1989). Altematively, 

using the Los Alamos HIV Database is described. In the peptide or a peptidomimetic might be chemically syn- 

particular, the principles were applied to study of the V3 thesized using standard chemical techniques. Monoclonal 

loopof envelope proteins of human immunodeficiency viius 45 antibodies to the peptide or peptidomimetic could be gen- 

(HIV). In biochemistry and molecular biology in general, erated using standard methods, as described for example, in 

covariation of particular residues of a protein likely indicates Harlow, E and Lane, D., Antibodies: A Laboratory Manual, 

the existence of a structural motif characterizing a region of Cold Spring Harbor Laboratory, Cold Spring Harbor, N.Y. 

the protein that has a functional, physiological role. (1988). Fragments of such monoclonal antibodies, for 

Envelope proteins are partially embedded in the lipid 50 example, F^ fragments, that have specific a£5nity for the 

membrane surrounding a virus particle, and project exter- novel structural motif could also be generated, 

nally from the lipid. When the lq)id of an HIV particle fuses In another embodiment, a ligand that interacts with a 

with the membrane of a host cell during infection, envelope structural motif identified according to the invention could 

proteins may also protrude from the membrane of the be generated. That is, the ligand would be characterized by 

infected cell. The V in V3 stands for ''variable", as the 55 having chemical moieties of suitable identity and spatially 

sequence of the V3 loop is highly variable between different located relative to each other so that the moieties interact 

virus isolates. with conesponding residues or portions of the motif. In 

Previously, a Los Alamos group in B. T. M. Korber, R. M. some embodiments, the ligand could be an agent, eg. a dmg) 

Farber, D. H. AVolpert and A. S. Lapedes, ''Covariations in that, by binding to the motif; interferes with function of the 

the V3 loop of HlV-1: An information-theoretic analysis**, 60 region. The ligand would therefore be an HIV antagonist 

Proc. Nat. Acad. Sd. U.S.A. 90 (1993), the disclosure of with potential therapeutic utility. Altematively, the ligand 

which is hereby incorporated herein by reference, described could bind to the particular V3 region comprising the 

2-ary covariation mutations in certain residues of the V3 identified motif, providing diagnostic utility. Such diagnos- 

loop of HIV 1 envelope proteins. Practice of the present tic utility can be ex vivo. A ligand with diagnostic utility 

principles has confirmed some of the Los Alamos group's 65 (e.g., an antibody) nught comprise a label, such as a fiuor or 

results, but has further permitted the discovery of other an enzyme conjugate for use in a colorimetric reaction, 

highly covarying groups of residues. Whereas the Los Fluorescence-labelled viruses or virus-infected cells could 
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be visualized or counted using fluorescence microscopy of parameter space if preceded by a fast preprocessing 

FACS (fluorescence -activated cell sorting). procedure, such as one provided by implementing the 

Methods of designing md identifying ligands that bind to principles described herein, that found plausibly cor- 

structural motife identified according to the invention are related variables in the database, 

also provided by the invention. 5 2. \^ual exploration of large complex data sets: If 

Thus, in one embodiment, the invention provides a ligand coupled to even a simple graphical display interface, a 

for binding with an envelope protein of human immunode- procedure such as ours permits a user to view quickly 

ficiency virus (HIV), wherein the envelope protein includes (with smaU ntimber of r-samples) the most plausibly 

a structural motif comprising amino acid residues A18/Q31/ interesting higher-order features in high-dimensional 

H33. The ligand includes at least one functional group lo data. 

capable of binding to the motif. In a preferred embodiment, 3. Pre-conditioning and redundancy elimination: Thus far, 
the ligand includes at least one functional group capable of we have stressed the utility of finding inter-attribute 
binding to and being present in an effective position in said correlations in order to use them in the building of 
ligand to bind to residue 18, at least one functional group models; but in many optimization, learning and data- 
capable of binding to and being present in an effective 15 fitting applications, one requires that correlations 
position in said ligand to bind to residue 31, and at least one between variables be foimd and eliminated, through 
functional group capable of binding lo and being present in any of a number of subspace methods like principal 
an effective position in said ligand to bind to residue 33. cg rnponents analysis (PCA). 

In another embodiment, the invention provides a method ""^^MBodiment Using a Programmable Digital Computer ( 

of designing a ligand to bind with a structural motif of an 20 Components for Digital Computer Embodiment 

envelope protein of human immunodeficiency virus (HIV). Data Ma trix. Sampling. and Coincidences . Given a set o f ^ 

The method includes providing a template having spatial Ml)bjects, eadi of which has either a ^Tes" (represcntable 

coordinates of A18, Q31 and H33 in the V3 loop of HTV-l by 1) or ''No" (repres^table by 0) value for each of a fixed 

envelope protein, and computationally evolving a chemical set of attributes^ the input dataset can be arranged into an 

ligand using an effective a^orithm with spatial constraints, 25 M-by-N^ table of values, which we shall caU the data matrix 

so that said evolved ligand includes at least one effective or simply ma trix, and this matrix, as well as its sub-matrice s 

functional group that binds to the motif. In a preferred and ' related vectors that e mprise junctiona l parts of the 

embodiment, the ligand includes at least one functional s^!^^cess described Below, are stored in memory loca- 

group capable of binding to and being present in an effective tions within a programmable computer. In this representa- 

position in said ligand to bind to residue 18, at least one 30 tion the rows of the matrix conespond to objects, and the 

functional group capable of binding to and being present in columns correspond to attributes. The matrix may be 

an effective position in said ligand to bind to residue 31, and labelled as V,y and each element of this two-dimensional 

at least one fiinctional group capable of binding to and being table labelled by v,y€{0, 1}, where i refers to the ith object 

present in an effective position in said ligand to bind to (row) O; and j refers to the jth attribute (column) a^. The set 

residue 33. 35 of objects may be listed, for the piuposes of this description, 

In another embodiment, the invention provides a method as O-o^, o^, . . . , o^ and the set of attributes may be listed 

of identifying a ligand to bind with a structural motif of an as A-a,, a^, . . . , a^^. 

envelope proteiu of human immunodeficiency virus (HIV). "■ flu. ii^ illustrates these terms as applied to the example 

The method includes: providing a template having spatial illustrated in FIG. Sa discussed in more detail below with 

coordinates of A18, Q31 and H33 in the V3 loop of HIV-1 40 regard to the program method description of a preferred 

envelope protein; providing a data base containing structure embodiment. 

and orientation of molecules; and screening said molecules A particular attribute a,, may be said to occur in a par- 
te determine if they contain effective moieties spaced rela- ticular object (row) i if a,y=l. 

tive to each other so that the moieties interact with the motif. Given an ordered list of 1 ^m^M objects (rows) 5, an 

In a preferred embodiment, a first moiety of the molecule 45 incidence vector 2 for an attribute a^ may be defined as the 

interacts with residue 31, a second moiety of the molecule binary vector or string of length m such that the gth bit is 1 

interacts with residue 31 and a third moiety of the molecule if and only if the attribute a^. occurs in the gth object in the 

interacts with residue 33. given list of objects. The mcidence vector 2 is a simple 

The principles described herein encompass similar representation of the pattern of occurrence of the attribute 

respective embodiments, including antigens and vaccines, 50 over some set of objects, for example, the set of all M objects 

for the other covarying k-tuples described herein, that is, or the set of objects corresponding to one r-sample as 

both residues of the V3 loop that covaiy, and particular d escribed b^ sss^x-^ 

amino adds at certain residues that covary. '^An r-sample, for example the three rows identified by 

The method of the current invention can be viewed as a reference numeral 4 in FIG. 5A, is a set of r of the M records 

''high-pass filter'' for detection of higher-order features. 55 drawn randomly from some probability distribution. In some 

Such HOFs play an important role in database modelling, preferred embodiments, the rows within sample are consid- 

madiine learning, and perception and pattern-recognition. In ered to be drawn independently finom a uniform distribution, 

database mining and modelling contexts, a procedure for ^ The drawing of an r-samples sample 4 is performed by the 

discoveryof these features might serve any of several major system one time within each of a specified number of 

roles, including: 60 iterations. In some preferred embodiments, the samples 

1. Preprocessing of large, complex datasets: Many of the drawn over the total number T or iterations are considered 

best modelling methods, including Gibbs models, Hid- to be drawn independently from a uniform distribution, 

den Markov Models and EM, MacKay's density In some preferred embodiments, different values of r arc 

networks; and related factorial learning methods fi:om used for different sequential iterations of the sampling, 

the neural network community, could be helped sig- 65 and/or for different subsets of the dataset processed by 

nificantly in capturing higher-order interactions with- different processing nodes in a parallel computing embodi- 

out exhaustive search or combinatorial explosion of ment. In such cases, we may say that on the ith iteration of 
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in the ith sample, the number of objects sampled is r^. Some In one particular embodiment of the invention, the func- 
advantagcs of using different sample sizes include: the tion f„y,tch («, h, r) is obtained from the multinomial distri- 
abiHty to try, within one tun-through of the method, different bution: 
values of r when one is unsure which values of r are best; and 

the ability to pick different values of r for different process- 5 / n \ 

ing nodes in a parallel computing embodiment, in order to 0 = ( ^^. ^ J M/ aa)*p(a.7. ■• . 

make optimal use of different processor sizes/speeds and 
memory sizes among the different processing nodes. An 

tT^^o^^^T^£^i^T^'&i'o] fo-n'-a ^ves >n esUn,ate of Uje probabiHty for 

the program code. finding exactly h occurrences of a^, h occiirrences of 

A coincident set, or cset, may be defined as a pattern a^^, , . . , and h occurrences of a^rj, all occurring in the same 
comprising the joint appearance of l^k^N^ attiibutes b rows, in one r-sample. 

(columns) 1 within some set of objects (rows) 5. That is, /.r^- n *• j ^ l -if u « u * 

given some one or more rows 5 under consideration, there (Tins funcUon definition has a smiple form because all but 

is a cset a^.^, a^-^ 5* if a^j, a.^, . . . and a.;^ all occur in 15 two of the large number of p( ) factors in the standard 

the given row or rows. For example, elements A@cl, B@c2, multinomial expression vanish with zero exponents .) 
D@)c4 identified by reference numeral 3 in FIG. 5/? are a t_ l -i -l. r * u r • u ^ *u i ** u * 

ooinddenc^ (cset) probability of a match of size h for the k attributes 

^ Wthi^^e OT^utcr memory is stored a data structure which make up a potential cset has been defined in temis of 
termed the cset table, whidi is a means for storing the 20 the joint probability pCa^^, . . . , aa); the Expected Count 
identity and occurrence coimt for each cset that occurs in one Functioa must employ particular estimates for these joint 
or more iterations within the process. The identity of a cset probabilities. In this preferred embodiment, the joint prob- 
is a list of attributes (columns) comprising the cset; the abiKty estimates inooiporale the hypothesis of independence 
occurrence count is a number corresponding to the number ^^^^^ individual attributes. THerefore in the definition 
of occurrences of a cset that have been observed up to a 25 o , . , i_ » 

particular iteration within the process, or at the end of all the formula given above we substitute 
iterations. In some preferred embodiments, the cset table is 

implemented as a hash table stored in a computer memory. . ^ rr /. / xx /- - x 

A cset has, for a given r-sample, a particular incidence U ''^^'^ ''^^^ ^ U " ''^^"^^ 

vector, which is its binary-encoded record of occurrences 
(denoted by *1') non-occurrences (*0') over the r data items 

in the sample. Therefore a cset, corresponding to a set of k • try ix-^x a 

attributes, may have an associated incidence vector; and an Hypothesis Test Function and Correlation Measure. An 
individual attribute may have an associated incidence vector. hypothesis test is a mathematical procedure, implemented as 

A match (or coincidence) of size h is said to occur, in a a computer program or subroutine, or in special purpose 
given r-sample, for a given cset a=(a^i, . . . , a^^^)? when a^^ electronic and/or optical hardware, whidi takes a pair of 
appears in h out of the r records, . . . , and a,-jt. appears in h number H^^^ and H^fc,, representing the expected and 
out of the r records, and they all appear in exactly the same observed numbers of coincidences, respectively, for a par- 
h out of r records (See HG. 5^). ^^^^ of ^ attributes, and produces a number C repre- 

Observed Counts of Coincidence, Hie comcidences are renting an estimate of the correlation among the k attributes, 
observed, and the corresponding csets stored or updated, by 40 

means of a binning method. In each iteration, the attributed In some preferred embodiments, a Chemoff bound on tail 
are binned, that is, places into separate subsets according to probabilities provides the hypothesis test function, as 
their incidence vectors 2 over the r-sample 4 for the cunent described below. 

iteraaon. In this described matrix-based embodiment of the Let random variabfcX, hold the value h, for each itetatioo 
mvention, these vectors act like-r-bit addresses mto a very 45 . 
sparse subset of 2' address space. (See FIGS. Sa and Sb). ^ ^ 

All the attributes in one bin constitute a cset. The cset is ^ 
recorded: if the particular cset has occurred in a previous x^V Xu 

iteration, then its count of occunences is updated; if it has ^ ' 

not occurred previously, then an entry in the cset table is 50 
created for it, and then its occurrence count is updated. In 

this described embodiment, the system stores the number and note that OiX^T r. The method of Chemoff-Hoeffding 

h:Oih^r of occurrences for this and each iteration. After a bounds [8] provides the following theorem: 

specified number T of iterations has been completed, the cset , ^ .t^,^!.,^.. ^• 

table conuins a list of aU the csets observed, and, for each 55 . l^t random vanableX, hold the value h, for each iteraUon 

cset a, a total number of observed coincidences, which ^» 

conesponds to X^^Jti^a), where h^a) is the number of joint 

occurrences for tlie k at&ibutes comprising a, for the ith x-Y X 

iteration. " 

Expected Count Function. An expected count function is 60 
a mathematical function, implemented as a computer pro- 
gram or subroutine, or in electronic or optical circuits, which ^^te that O^X^T r. The method of Chemoff-Hoeffding 
takes a set of attributes a^, a^ . . . , a,^ and a number T and j-gj following theorem: 
produces a number corresponding to an expected number 01 

coincidences for that set of attributes in a process of T 65 Let X-Xi+X2+ . . . +X„ be the sum of n independent 
iterations of drawing of r-samples and observing coind- random variable s, where l,-^X,-^u,- for reals 1,- ("lower") and 
dences. u,- (**upper"). 
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XheD Aftei some number of sampling iterations has been 

perfonned, the reporting of sets of correlated attributes may 
/ \ (1) be performed for some or all of the recorded coincident sets 

rye i\x\ D J] ^ c3cJ "^^ determined, in the comparisons, to signal 

I' 5 significant correlations between the component attributes. 

' This may be done for all csets at once, or for any subsets of 

them at different points throughout the process. These com- 
For our purposes, we set n^T and 1,^ and u,«r,- for all i^l , parisons for different csets may be performed sequentially or 
2, . . . , T, and we thereby obtain in parallel, or in some combination thereof. 

Program Method Description of a Preferred Embodiment 
/ \ (2) ^ Below is shown, in pseudocode, a program on appropriate 

W - EiX\ > d] s cxJ I media, for example, a floppy didt, hard drive, RAM or other 

^ Z/f j such media, corresponding to one possible embodiment on 

a programmable digital computer. 
FIG. 5fl provides a picloriaJ example of the application of 
Using this mathematical relationship, an effective prooe- this embodiment to a fictional toy dataset. Three iterations of 
dure for computing a correlation value can be defined: r-sampling (for r=3) on the toy dataset are depicted, top to 

bottom. For each iteration, the left-hand box represents the 
dataset, with outlined entries representing the sampled rows. 
The right-hand-box represents the set of bins into which the 
20 attributes collide. For example, in the first iteration, A@l, 
B@2, and D@4 all occur in the first and second of the three 
sampled rows, so they each have incidence vector 110 and 
In the special case wherein the same sample size r is used collide in the bin labeled by that binary address. Bins 
for every iteration of the sampling, that is, when r^^r for all containing only a single attribute are ignored; and "empty" 
i-l, 2, . . . , T, then the above formulas reduce to the simpler 25 bins are never created at all. All bins are cleared an^ 
forms: removed after each iteration, but collisions are recorded in 

the Csets global data structure, 
/ 2^2\ „ V Procedure to find correlated sets of attributes: 

T\x-^x\>s\^cA — \ ^-^^ 0. begin 

^ ^ ^ 1. read (MATRIX); 

- 1 ^Jz]£i±ll!^\ 2- read (R, T); 

^ ^ ■ ™ T-f^ I 3. compute Jrst_order_jnaiginals{MArRIX); 

4. csets:-{}; ' 

Here the correlation value corresponds to an estimate of \' ^^LilffT^ y^tn kit atdiv\. 

1 minus the probability of having observed H„,, 35 6. sampled_r()«^:-«samp^ 

coincidences, over T iterations of r-sampling. if the hypot l atobutes:-get_attnbutes(«mp ed tow^^^^ 

eses underlying the expected count hI, w;re true. If the «• alL_comcidenceK-ftnd_alLcoiocidenc^^^ 

assumption of independence between thrattributes was used f • f?[ «»n«denc« in all comcideiices do 

to compute L described above for some preferred JO. if cset al«ady eHfs(»incidence csets) 

embodiments, tb'en this hypothesis test provides a correla- 40 SL" K^'SS^ricr^U- 

tion value for each cset that estimates the deviation firom a(i<i-new_csencomcKience, csets;, 

independence; that is, it estimates the statistical dependence J^' 

between the attributes making up the cset. ' 

Operation of the Components Within a Process • . j 

IVpically, the representation component is performed first 45 ^'"^ ^ ^ ^ . . ./a 

withta the overall process of the clirrent invention. A ph.- ^. expected:=compute_expected^atch_^counKcset); 

rality of sampling iterations is performed on the represen- «• observed: =get obsen/ed^atch_count^cset); 

tation of the data and for each r-sample, the detection and 19. stats|-update_stats(cset. hypoth_test(expectcd, 

recording of coincidences is performed. Hie sampling itera- ° 

tions may be performed sequentially or in parallel, or in so ' . . ^ . ^ ^ . 

some combination of sequential and parallel steps. prmt^BaUstats(cseU, stats); 

At any stage within the process, the determining of an c u ^ a a « *u ♦ 

expected coiit of coincidences, for some or all of the Steps 5 through 21 of the pseu^ 

coincident sets of attributes, is performed. THIS component of the base method descnbcd herem, namely: , 

of the process may be performed all at once for all coinci- 55 samplmg a subset of the mattix for a predetermmed 

dent sete, or incrementally; sequentially or in parallel, or in ^^"^^er of iteraUons, each subset of attributes being the 

some combination. It may be performed for coincident sets same, 

(csets) as each coincidence is detected or stored, or may be detecting and recording counts of coincidences of 

performed before or after such detection or recording. attributes in each sampled subset, a coincidence being 

After some number of sampling iterations has been 60 theoocurrenceof a phirality of attributes in an object in 

performed, the comparing of actual to expected number of a sampled subset, where the plurality of attributes is the 

coincidences may be performed for some or all recorded same for each occurrence, 

coincident sets. This may be done for all csets at once, or for determining an e^qpected count for each coincidence of 

any subsets of them at different points throughout the interest, the determining being perfonned before, at the 

process. These comparisons for different csets may be 65 same time, or after sampling, detecting and recording, 

performed sequentially or in parallel, or in some combina- comparing, for each coincidence of interest, the observed 

tion thereof. count of coincidences versus the expected count of 
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coincidences, and from this comparison detecmining a pseudo-random or random, any of a oiunber of random 

measure of conelation for the plurality of attributes for sampling schemes may be used, including hypergeometric 

the coincidence, and and multinomial sampling. The r objects within an r-sample 

reporting a set of k-UipIes of correlated attributes, where may be sampled "with replacement" or "without replace- 

a k-tuplc of correlated attributes is a plurality of 5 ment". At the next level up, the set of r samples themselves 

attributes for which the measure of correlation is above may be drawn "with replacement" or "without replace- 

a pre-determined threshold. ment". 

Appendix "B'* contains actual soiirce code written in the Different choices for the key sampling parameter r are 

Perl language for running on a Sun4 computer in the Sun possible, and it is not necessaiy to use the same munber r for 

UNIX operating system. Sample input data for the code lo each sample. 

listing in Appendix "B" is listed in Appendix "C for partial Many possible choices exist for T, the number of sam- 

amino acid sequences from V3 loop of HIV envelope pling iterations. It is possible to use any of a number of 

proteins. The corresponding output from the code of Appen- mathematical methods for choosing T in order to achieve a 

dix "B" for the input of Appendbc "C is shown in Appendix desired confidence level in the degrees of correlation esti- 

"D". In order to produce the ou^ut of Appendix "D", the 15 matedforthek-tuplesof attributes discovered by the method 

adjunct Perl language program listed in Appendix "E" was of the current invention. Alternatively, it is possible lo run 

used for clarification and presentation from the main code the procedure for a given fixed number of iterations and then 

listing in Appendix "B". A general flow diagram for this print or view the results, or to interleave the running of some 

embodiment is shown in FIG. 6, while a general block number of iterations with the printing or viewing of partial 

diagram is shown in FIG. 7. The resulting report was stored 20 results. 

in a flat file as a relatively imstructured asdi database, which Many possible ways exist for the representation, storage, 

was later printed; it could equally well have been sent to a and accessing of the Csets data structure used during the 

printer directly or sent across a network for report to other processing of the algorithm. The Csets data may be stored 

resources. and accessed via a hash table, a k-d tree, patricia tree (also 

Alternative Embodiments 25 called a trie), and/or in other ways, known to those skilled 

Descriptions of altemative embodiments of the present in the art, of storing and accessing data efficiently. Whatever 

invention may be divided into two categories, described data structure is chosen, the structure may be stored physi- 

separately below: first, different physical embodiments of cally in registers, in main memory, and/or on secondary or 

the system/process as may be used in many potential external storage media such as magnetic disks, magnetic 

problem-specific appUcations; and, second, different inter- 30 tape, or optical storage media. 

pretations of the components enumerated in the description Altemative to the embodiments of the method on general- 
above, according to different problem-specific applications purpose computing hardware of various types, there are 
of the present invention. many possible embodiments on special-purpose electronic, 
Different Implementations optical, or electro -optical hardware, or some combination of 

For example, among the many possible embodiments as 35 general-purpose and special-purpose architectures and 

programs on programmable digital computers: devices. 

The method may be run entirely sequentially, as in the For example, vcrv efficient special purpose electronic 
most straightforward interpretation of the pseudocode given (jJSl or VLSI) may be used to implement the matrix re pre- 
above, or the method may be run on parallel (vector or sentation of the current invention, by the fact that th e 
multiprocessor) or distributed computer systems in many 40 incidence vectors of attributes are simple binary veaors, b y 
possible ways. A set of computations may be run in paraUel, t he fact that the coincidence "bins^^ described earlier in o ne 
in which each computation performs the entire program v iew of the current invention, correspond to "addresses" t o 
steps outlined above, but with each separate computation a jnemory space of size 2"^ for each r-sample, and by the 
using a different value for r, the sample size; or eadi separate a bility with current technology to design, fabricate and u se 
computation could run the same program steps with same 45 special-purpose hardware for implementations of random - 
key parameter values, but start with different initial random number^eneration and samplinfi, fast-access storage of the 
number seeds for the random r-sampling. Alternatively, the Csets ^ta stmcturcs, and of the mathematical runctio ds 
entire program steps outlined above could be nm once, but used in the calculation of expected co unt cRtipj^^i^^ and 
each different r-sample could be forked off into a separate hy)othesis tests and correlation estjiyiates. 
process nm on different processors, where in each such 50 Special Purpose Hardware Method Description of a Pre- 
process would comprise the detection and optionally record- ferred Embodiment 
ing steps, with the global cset coimts later joined into the 1. Overview 

global process and g^Iobal data structures. AdditionaUy, the Referring now to FIG. 14, an embodiment of special 

computation of the expected counts, and the comparisons of puipose hardware mentioned previously is intended to 

expected with observed counts, could be performed all at 55 exploit the potentialbenefltsofparallelizing the execution of 

once or incrementally, sequentially or in paralleL Similariy, the algorithm. A node (defined below) divides a given data 

the reporting of the estimated correlation values can be set along M (the number of rows of data) and distributes 

performed for some or aU of the Ctets, once at the end of these portions to its CPs (also defined below). The CPs may 

computation or incrementally throughout, in serial or par- be either other nodes (in a recursive definition) or may be 

afl el. ^— ^ special purpose processors developed to perform step 8 in 

•""Tte output of the method, which can include the reporting the method as described in high-level "pseudo-code" in the 

of the significantly correlated k-tuples of attributes (the csets previous Program Method Description of a Preferred 

that are deemed sufficiently highly correlated in the Embodiment Section. When the results have been computed 

comparing, aJc.a., hypothesis testing stage), can be verbal, by the node's CPs, the merging step (steps 9 through 14 in 

and/or numerical and/or graphical. 65 the above-noted "pseudo-code" description) is performed by 

A number of sampling schemes are possible, including the node. Once the merging has been done, the results are 

deterministic, pseudo-random, or purdy random. And if passed back to the node's parent. If the node is the root of 
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the tree, the complete results set is sent back to the ddver 
that controls this hardware. The system described below can 
be used "off-line'' &om a main computer's CPU; among 
other possibilities for commercial marketing and use of such 
a system is its implementation on a special ^oard" or "card'' 
that a user can purchase and install on his or her personal 
computer or workstation. One can also envision the use of 
one or a number of such special subsystems on a local area 
network or a ''supercomputer" installation. The described 
embodiment represents only one of many possible ways, as 
will be understood by those skilled in the art, to parallelize 
the methods describeid herein. 

This implementation described below is assumed to act 
solely on character-valued data attributes. This is in no way 
a limitation of the basic methods described herein, rather it 
is a specific implementation of the basic methods. The 
implementation could easily follow a binary-attribute 
encoding as described elsewhere herein. 

A diagram of a node is shown in FIG. 14 with compute 
processors (CPc). The node includes the following: 
A bank of memory where input to be sent to the CPs is 
stored (the input buffer) and where results found by the 
CPs will be stored (the output buffer). 
A memory bus divided into control, data and address 
buses used to arbitrate communication on the bus itself 
as well as being the vehicle for data transfer. 
Aset of bit flags and a small additional portion of memory 
(LastOut). LastOut is the address of the section in the 
output buffer that was last written to. The two bit flags 
are used by the merge and I/O processois to determine 
what state they each are in. 
An array of size J of compute processors (CPs), each with 
their own local memory caches, which perform the 
discovery of coincidences. 
A merge processor (MG) whidi has its own cache of 35 
memory in whidi it writes the merged results of the 
CPs. 

An input/output processor (lO) whose main responsibility 

is to control use of the bus. 
A clock which is used to ensure that each element in the 40 
system mns synchronously with respect to every other 
element. Execution of each of the parts in the system 

can be thought of as running i,p Inck-step. _ 

Computer processors are defined as being either special 
processors that perform the R-sampling step of the algorithm 45 
(step 8 in the pseudo-code description and graphically in 
FIG. 5a. Hiis allows the possibility of a tree structure of 
such nodes rather than limiting embodiments solely to a 
vector arrangement. For any particular choice of hardware 
for the memory bus, it may be the case that there is a 
maximally useM limit on the number of CPs per node. A 
tree structure allows a way around this limit. 

Hie implementation assumes that maximal values of 
method parameters R and N (Rmax and Nmax) are specified 
a priori. It is the responsibility of the software driver to 55 
detect when these limits have been violated are react accord- 
ingly. 
2. Bank of Memory 

For each node, memory of size 2*J*Amax*Rmax*Nmax, 
where Amax is the maximal total number of iterations that 60 
can be done in the node. This memory is divided equally into 
the input and output buffers. Note that the size of the input 
for a single iteration is no greater than PRmax^Nmax and 
neither the locally-produced results nor the final merged 
results (formed by combining the partial results from the J 65 
CPs) can exceed this limit, so there is no risk of exceeding 
available memory. 



50 



Access to this memory is as follows: 

10 has write access to the input buffer and read access to 

the output buffer. 
MG has no access to the input buffer and read access to 

the output buffer. 
CP has read access to the input buffer and write access to 

the output buffer. 

3. Memory Bus 

Control of the memory bus is the responsibility of the 10 
processor. Each CP is assigned a numeric identifier (0 to J+1 
as 10 is implicitly assigned zero and MG is assigned 1). The 
memory bus is divided into three sections: 
Control: TWo wires for each CP, two for MG and two for 
10 comprise the control bus. The first of each pair is 
called the request wire while the second is known as the 
response wire. 
Address: Each device in the system is assigned a unique 
memory address range. The address btis, tised in com- 
bination with the data bus, determine what device the 
current value on the data bus will be written to and, if 
applicable, where within that device it will be stored. 
The width of the address bus (i.e. the number of wires 
in it) is determined for a choice of size for the memory 
storage of input and output and thus will not be 
specified here. 
Data: Given the assumption that only character-valued 
data attributes will be handled by this system, the data 
bus is eight wires wide. 
Bus arbitration is handled through the use of the control 
bus. When a device (here meaning MG, 10 or one of the 
CPs) wishes to use the bus, it asserts a logical 1 on its request 
wire. On any given cycle, more than one device may have 
done so. 10, when it returns to its bus arbitration duties, 
simply sets the lowest numbered device's response wire to 
1 and zeroes all the other response wires. This tells the 
lowest identified device that it has permission to use the bus 
(reads and writes are not indicated — lO is responsible for 
establishing the context) and all others that they must wait. 
All devices that wish to use the bus continue to assert 1 on 
their request wire until given permission. When the permit- 
ted device has finished with the bus, the device asserts 0 on 
its request wire, indicating to 10 that it may reassign the bus 
to another device. "Handshake" and other types of protocols, 
such as described above, are well-known to and understood 
by those skilled in the art. 

4. Bit Flags and Additional Memory 

The additional memory is used by 10 to store the last 
written output section. There is no need to store a list of such 
sections for MG because "writers to the output buffer are 
done incrementally and MG can determine how many 
unused sections it has waiting by comparing its last read 
index with the last written index. Only 10 can write to this 
memory and only'MG may read from it. 

TWo bit flags are used to indicate "10 finished" (meaning 
10 has sent all data out and received all CP output) and 
'*Merge finished". 

5. An Array of J Compute Processors 

As noted above, these are either nodes or are special 
piupose processors that compute one R-sampling step in the 
algorithmic description of the general method of the current 
mvention. In the latter case, they may comprise: 

a processor which performs the coincidence detection in 
addition to the fhnctions listed below 
2*Nmax*Rmax sized local memory 
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The memory is split into two equal portions for input and outlined earlier. Initially^ 10 sets CI and C2 to zero and 

output. zeroes its bit vector (indicating that it has sent no data to any 

Initially, a CP asserts 1 on its request wire, indicating that CP) and waits for the software driver to start sending it data, 

it is ready for data. When it sees only its response wire set During this time, it knows that no work can be done, and 

to one on the following cycle, it expects to be sent the 5 thus zeroes all permissions for the bus. An interrupt signals 

current values for R and N and then the data itself the arrival of data from the driver and 10 continues to zero 

(otheiwise, it waits for this to be the case). Based on the first all communication requests until all the data has been 

two values, it can determine when the current input is writt«i ^ ^^e input buffer. Hie incommg data is of fomi: 

exhausted. It then asserts 0 on its request wire and performs ^ 

the binning and coincidence detection steps of the method, lo R 

When these steps have been completed the CP asserts logical T, the total number of row sets of size R sent data stream 

1 again on its request wire, this time indicating its desire to of size TRN 

send its results. When given permission to use the bus, it 10 can thus determine when no more data can be expected, 

sends its coincidence set to 10. 10 is responsible for man- it is the responsibility of the driver to: 

aging the location for storage of this data. The output stream 15 ^^ide data mining requests mto sizes no greater than 

of the CP comprises a tally of the coincidences found Amax 

foUowed by the coincidences (csets) themselves. The coin- ensure that the number of rows sent as input is evenly 

ddencesareoftheform: divisible by R 

hit count (no higher than Rmax) that Rmax and Nmax have not bee exceeded by the 

• . • -j^i. r *u * ' *i. u t Current data set merge all results sent back from the 

size (that is, the width of the cset, i.e., the number of 20 ^levice 

component attributes) q^^^ .^p^^ ^^^^ ^^^^^^^ Iq ^^^^ ^^^^ ^^^e R*N 

a size-long list of the attributes of the coincidence in form e^ch CP^ by first setting the ilh bit in the vector lo one (this 

(value, position) indicates that 10 should expect output fi:om CP^, signaling 

When all data has been sent to 10, the CP asserts 1 on its that CP by setting its response wire to 1 while zeroing all 

wire to request more data. 25 others, sending the data onto the bus and finally increment- 

6. Merge Processor MG ing C2. 

Hie merge processor may comprise: — WE^all CPs are busy (or all available input has been 

a processor that runs the merging step exhausted), 10 waits for a CP to assert 1 on its request wire 

NmaxRmax local memory used to store the output from ^hich indicates that it is ready to send back results. Once 

one CP ^ signal has been received from a CP, lO rctnevcs the 

^ /.u r * i *v, i * * ♦ results from the CP, stores them in the output section 

counters CI and CI (the foriner tracks the last output ^^^^^^ ^^^^ associated with that 

section read by MG; the latter counts the number of increments CI and asserts 1 on the MG request wire. If 

coincidences currendy stored m the merge buffer) ^ ^^^^ ^ ^^^^ ^^^^^^ Iq ^^^^ 

memory used to store the current value of A 3^ available R*N set to the CP who just reUimed results (setting 

memory of size JNmaxRmaxAmax used to store the ^it for that CP to one). When C2 equals T and the bit 

merged results vector contains no bits set to 1, then 10 knows that it is 

Initially, MG sets its counters to zero and its request wire to finished and sets the 10 bit flag to 1. At this point, 10 goes 

zero and waits for 10 to signal it (by setting this wire to 1) b^ck to the previously described wait state until it sees the 

that there is output data to be processed. 40 mg bit flag also set to 1 (indicating that MG has finished its 

When MG sees tiiat its request wire has been turned on, ^i^o^). Once this occurs, 10 calls an interrupt (if this node is 

it knows to start receiving output data indexed by the counter ^ot of the tree) or just requests to send (if this node has 

into its local memory. Once this has been accomplished, MG another node for a parent), gives MG permission to write 00 

can start the merging algorithm. The merge is done from the ^us and then passes all data sent from MG to the parent, 

local memory directiy into the merge buffer (C2 must have 45 ^ote that the proposed scheme allows for unequal execu- 

the current number of coincidences when this step is tion time among the CPs— the next CP to get data is the one 

finished). When this step is completed, MG retrieves the ^Qst recently finished with ite last allowance of daU. Thus, 

current value of LaslOuL If it is greater than CI, then Mg ^ven though the overall operation of the system is clocked, 

knows it can increment CI and move directly on to the next ^^^^ ^ degree of asynchronous processing ability. 

oulputsection.lf CI and LastOut are equal, UienMG sets its 5Q choices for particular processors, buses and otiier 

requestwiretozero.If CI has reached A* J, then MG knows components are open to the discretion of designers, 

that all die results have been computed and merged (and fabircatois, manufacturers, sellers, buyers and users, and the 

thus, tiiat all CPs and 10 are idle) and that it should set its Tinges of options are known to hose skilled in die art: In 

bit flag to one (indicating that it is finished) and start sending particular, all parts of the embodiment described above may 

tiie contenu of the merge buffer back to 10 for transmission 55 5^ obtained from "off-the-shelf* sources, or may be spt- 

to this node's parent. The results are sent simply as the value ^ihHy designed at the VLSI level by persons skilled in the 

of C2 followed by the list of coincidences stored in the ^ai. 

merge buffer (the form of the coincidences is identical to that Different Applications 

described in section 5 above). General 

7. Input/Output Processor 10 ^ Special-purpose embodiments are also possible. For 
10 contains: example, in an application to marketing and analysis of 

a bi t vector of size J sales/transactions data, the objects input to the methods of 
'•"a counter, Tjl, mdicatmg the next available output bin the present invention can correspond to transactions, and the 
a counter, C2, indicating the next unused R*N portion of attributes correspond to instances of sale of particular prod- 
input 65 ucts or services. 
10 is intended to govern the execution of the algorithm as a In an application to the process management, industrial 
whole as it is responsible for the bus arbitration scheme engineering or computer systems manageinent, the objects 
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can corre^ond to particular time slices or time periods, and might correspond to the presence, absence, or levels of 
the attributes correspond to the on/off or used/unused status particular geometric, chemical, physical, biological, 
of particular components, resources, or subsystems. The toxico logical, therapeutic and/or other properties and 
goal of the application could be to find k-ary conflicts or features, e.g., particular chemical moieties. The present 
conflicting demands among interacting subsystems or users, 5 method would be used to find correlations among k-tuples of 
in order to improve the efficiency or lower the costs of the such properties, and this information can be useful in the 
operations. design and testing of compounds and drugs, and in the 

For example, the methods can be adapted to control a design of combinatorial libraries for screening and testing, 
process for production of a product as shown in the general or for other processes or steps in drug discovery and drug 
flow diagram of FIG. 8 and the schematic diagram of FIG. lO design. Alternatively, the above mapping can be transposed, 
9. This example can represent an automated sheet metal so that the objects correspond to the properties and features, 
assembly plant. The methods could be apphed to existing and the attributes correspond to the compounds and drugs, 
data set in order to discover correlation that indicate demand In this way, the present invention can be used to find sets of 
for one of the products from the plant will significantly drugs with similar or complementary or synergistic or 
decrease in the summer months due to cyclical variations, 15 antagonistic activities. This, too, is extremely useful in drug 
while demand for another product increases. A link to discovery and drug design. 

automated process control systems in the plant could reduce In appUcations to demographics, marketing, insurance 
orders for. the first product, while increasing orders for and credit ratings, and/or fundraising, the objects can cor- 
another. Many other examples will be evident to those respond to particular people, or companies, or oiganizations. 
tolled in the art, including variations to the actual structure 20 The attribntes could correspond to the presence or absence 
of the products as a result of discovered correlations. or levels of properties and features relating to employment. 

In an alternate embodiment, the discovered correlations income, wealth, credit history, lifestyle, consumption 
may be used to generate rules for a rules based system that patterns, or social/political opinions or affiliations. The 
in turn produces products based upon those rules. A general present method could be used to discover associations 
flow diagram for such an embodiment is set out in FIG. 10. 25 between such factors, which can be useful in such tasks as 
A corresponding schematic diagram is set out in FIG. 11. predicting credit/insurance risks or detecting fraud; or in 

In a further alternate embodiment, the rules based system determining the best targets for allocating limited marketing 
could be iised to control a process that creates products. A or fundraising resources, for example, 
general flow diagram for such an embodiment is set out in The problem of finding all significant correlations among 
FIG. 12. A conesponding schematic diagram is set out in 30 pairs or k-tuples of attributes in a database is ubiquitous in 
FIG. 13. the computational sciences and in medical, industrial, and 

In application to financial analysis or trading, the objects financial applications. The principles described herein 
can correspond to particular time slices or time periods, and include a probabilistic algorithm that has the interesting 
the variables can relate to particular prices, or price changes, property of finding significant higher-order k-ary 
of particular financial instruments or commodities. By divid- 35 correlations, for all k such that 2^kiN in an N-attribute 
ing the prices of each instrument or commodity into a set of database, for the same computational cost of finding just 
discrete levels, or by using a simple binary code for significant pairwise correlations. Moreover, k need not, be 
"increase vs. decrease", one can represent each such instru- fixed in advance in our procedure, in contrast with other 
ment or commodity by a set of attributes, and the invention known procedures. The procedure was deigned for the task 
can be employed to discover k-tuples of instruments or 40 of finding conserved structural relationships in aligned pro- 
commodities whose price movements are correlated. Those tein sequences, but may have more useful application in 
in the art know of many ways to gain value from sudi other domains. 

discovered information. Application of the Principles Described Herein to Protein 

In applications to medicine, epidemiology, or environ- Sequence Analysis 
mental sdence, the objects can correspond to particular 45 There are interactions between sequence-distant amino 
patients, or to different timed observations of a single / acid residues in the protein chain, sometimes detectable as 
patient, or samples from the same or different environmental / correlations between positions (columns) in a set of aligned 
resource (such as air, soil, or water); the variables and / sequences from a protein structural family, that play an 
derived attributes would correspond to levels^ or the / important role in determining structure and function. Dis- 
presence/absence of particular symptoms, drugs, toxins or/ 50 covered correlations may represent an evolutionary history 
contaminants. In this way, one can use the present invention/ of compensatory mutations, and may provide usefid features 
to discover interactions ^t may cause disease or environ-j in models of protein structural/functional families, but are 
mental hazards.^ — < • • ' 1- ignored or mishandled by most ML (machine learning) 

B'^Rjlectilar and structural biological applications, the\ classification methods, in part because of the high compu- 
objects might correspond to DNA, RNA, or protein Vs tational complexity of searching for k-tuples of correlated 
sequences and/or structures. The attributes might corre^ond ^ positions. 

to the presence of particular bases or amino acids at par- In order to practice the invention on a matrix of biological 
ticular sequence positions, or to substructures with particular sequences such as nucleotide or amino sequences, the dif- 
geometric, chemical, physical, or biological properties at ferent sequences are first optimally aligned for the purpose 
particular sequence or structural positions, or to the presence 60 of comparison. A position in a first sequence is compared 
or absence or levels of other global or local properties. For with a corresponding position in a second sequence. When 
example, set out further below is a detailed application of the the compared positions are occupied by the same nucleotide 
method to protein structure prediction, examples of which or amino acid, as the case may be, the two sequences are 
have previously been described.. identical at that position. The degree of identity between two 

In pharmacological applications, the object might corre- 65 sequences is often expressed as a percentage representing 
spond to molecular structures or other labels or representa- the ratio of the number of matching (identical) positions in 
tions of particular compounds or drugs, and the attributes the two sequences to the total number of positions com- 
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pared. Optimally aligning tow or niore sequences generally 
involves maximizing the degree of sequence identity 
between them. 

Several algorithms and computer programs are known to 
those of ordinary skill in the art for aligning sequences. 
These tools include the PILEUP program from the Genetics 
Computer Group (Madison, Wis.)packagc (version 8) using 
a modified version of the progressive alignment method of 
Feng and Doolitde [J. MoL Evol 25, 351 (1987)]; 
CLUSTAL X, freeware available from the European 
Molecular Biology Laboratory (EMBL), Heildelberg, Ger- 
many; and BLAST, freeware available from the National 
Institutes of Health (NIH), Bethcsda, Md, BLAST-? is used 
for amino acid sequences; BLAST-N is used for nudcotide 
sequences and BLAST X is used for nucleic acid codon/ 
amino acid translation. 

Several kinds of useful information can be obtained from 
protein sequence family analysis. 

First, there is information to be extracted at the level of 
individual sequences, in the form of joint symbol frequen- 
cies. It is well-known that an abnormally high observed 
frequency of a particular siqgle-position pattern (e.g., "G 
occurs at residue number 3 in 98% of these sequences'^ can 
reveal an important physico-chemical constraint on second- 
ary or tertiary structure. This is also true of surprisingly- 
frequent joint symbol occurrences (e.g., "G at position 3, L 
at position 5, and M at position 87 occurs much more often 
than would be predicted by the individual marginal 
frequencies"). Sudi long-distance co-occurrences might be 
especially indicative of tertiary constraints, because the 
designated positions may be nearby each other in the 3D 
structure to which all of the modelled sequences conespond. 
(This detection of "suspicioxis coincidences", as when p(A, 
B)»p(A)p(B), is at the heart of pattern recognition and 
learning, as noted long ago by others). 

Second, there is information to be extracted at the "next 
level up", of statistical relationships between the positions 
(columns in an alignment of homologous sequences). If the 
existence of frequently occurring joint symbol k-tuples can 
be used to infer 3D structural interactions, such an inference 
is even better supported by certain information-theoretic 
relationships between positions (columns) over a set of 
many different joint symbol occurrences. This is because 
such symbolic relationships caii signify evolutionarily con- 
served physical or structural relationships between different 
parts of the protein chain. (See FIG. ISa), The observation 
of high values of mutual information and other correlation 
measures between columns has been used successfully to 
predict 3D structural interactions in RNA and in HIV 
proteins, for example, see C. E. Shannon and W. Weaver The 
Mathematical Theory of Communication The University of 
Illinois Press, 1964. While these previously reported efforts 
have focused on pairwise residue — residue interactions, the 
principles described herein, aim at the detection of k-ary 
interac tions for 2^k^N. 

discovered k-tuples of correlated amino add residues 
cane be used in protein structure prediction and structure 
determination. 

Local predictions can help narrow the seardi for the best 
global structure predictions. 

First, there are distance geometry constraints. Secondary 
structure prediction, and the discovery of k-ary long- 
distance interactions, give evidence for presumed contacts, 
of the form contact(ij) for the ith and jth amino acid residues 
in a protein. Using the kind of distance geometry theory 
developed by others (see for example, T. F. Havel, L. D. 
Kuntz, G. M. Crippen The Theory and Practice of Distance 
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Geometry Bull, of Mathematics Biology v. 45 1983 pp. 
665-720. and K. A Dill, K. M. Feibig, H. S. C3ian Cocp- 
erativity in Protein-Folding Kinetics Proc. Natl. Acad. Sci. 
U.S.A. V. 90 March 1993 pp. 1942^1946), one can derive a 

5 set of inferred contacts. One can also derive sets of inferred 
blocks, contacts that are foibidden by a given set of pre- 
sumed or inferred contacts. Essentially, given a model of a 
polymer chain constrained to exist within a fixed volume, 
the assumption that two particular pieces are brought into 

10 contact implies that some other pieces arc also brought into 
proximity and that still other pieces are moved further apart. 
Indeed, others have concluded that "considerable amounts 
of internal architecture (helices and parallel and anti-parallel 
sheets) are predicted to arise in compact polymers due 

15 simply to steiic restrictions. This appears to account for why 
there is so much internal organization in globular proteins." 

Second, as discussed throughout the previous sections, 
one can infer and exploit empirical relationships between 
local and global configurations. Local stretches of sequence, 

20 or selected non-local pairs of residues, can be found to occur, 
with some high probability, in particular global configura- 
tions. Heuristic rules, in whatever form, can be used to avoid 
large parts of conformation space. This inference of particu- 
lar models of cooperativity in folding is a special case: 

25 knowledge of **rules" such as p(contact(i j)|contact(i+l j-1)) 
>p(contact(i j)) can help significantly. 

For example, FIG. 16 illustrates steps in tertiary structure 
prediction. The methods described throughout this applica- 
tion can be applied as part of a larger tertiary structure 

30 prediction system, wherein the principles described above 
are employed in the block related to' the analysis of aligned 
sequence families. The system predicts the structure of a 
protein. 

Discovery of Evolutionarily-Conseived Structural Con- 

35 straints 

Three questions are addressed in this section: 

1. What kinds of evolutionarily conserved multi-residue 
structural or functional constraints might one expect to 
find by detecting correlations between columns in a 

^ multiple sequence alignment? 

2. Have correlation-detection efforts in face found impor- 
tant structural or functional constraints? 

3. How much information do such discoveries provide 
towards predicting or determining a molecule's native 
tertiary structure? 

AVhat Do We Expect to Observe? 

A protein family is the set of amino acid sequences that 
are believed to share a common global tertiary structure. The 
theory and observation of protein folding and evolution 
supports the general idea of evolution and conservation 
within a protein family: 

Functional constraints are conserved in surface residues; 

Structural constraints are conserved in core residues; 
55 Mutational drift dominates in loop residues; 

Functional constraints often involve other molecules — 
such as other proteins, nucleic acids, lipids, metals, O2 or 
other small molecules. 

The kind of structural constraints expected to be con- 
60 served throughout evolution of a protein family are mainly 
those involving a few key residues that stabilize a confir- 
mation. Where electrostatic interactions are deemed 
important, one might expect to find a conservation of net 
charge across two or more sequence positions. When one of 
65 two electrostatically interacting residues carries a positive 
charge, its ^^partner" residue (presumably close in 3D struc- 
ture even if distant in sequence) should be negatively 
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charged, aad vice versa. The situation is similar for packing 
constraints. One might reasonably expect sections of the 
protein core volume to vary only slightly across the many 
different proteins in the same structural family, while non- 
core regions might display large volume variability. Thus 
one might expect to find pairs or small k-tuples of residues 
that display mutually compensatory mutations with respect 
to side-chain volume — ^when a "Large*' mutates to a 
"Small", another "Smalf must mutate into a "Large", to put 
it simplistically. 
What Has been Observed? 

Neher et al. (How frequent are correlated changes in 
families of protein sequences PNAS, 91:98-102, 1994) 
attempted to quantify the frequency of compensatory 
dianges within a single protein family by using physico- 
diemical property iodices for amino acids and then estimat- 
ing Pearsonian correlations between columns in an align- 
ment. They attempted to get around the small-dataset 
problem with a bootstrap-inspired resampling scheme based 
on the examination of pairs of sequences from the family. 
Their study of the myoglobin family of protein sequences 
found the degree of compensatory mutation to be low for the 
property of side-chain volume but high for electrical 
diarge — close to the correlation level expected for perfect 
conversation of local diatge. The authors speculate that 
because their column-pair analyses focused only on contact- 
neighbour pairs of residues, they were able to detect a very 
locally-acting constraint like charge conservation but not a 
more distributed constraint like conservation of volume. (In 
other words, a single positively-charged residue must be in 
contact with its single negatively -charged structural partner, 
whereas a set of compatible-volume partners may comprise 
more than two residues and need not all be in contact). 
Others have also found some evidence of coordinated muta- 
tion in the evolution of protein structural families. 

While most studies, to date, of compensatory mutation 
focus on highly-conserved "core'^-type regions of protein 
structures, Korber et al. (Covariation of mutations in the V3 
loop of HIV-1: An information-theoretic analysis. Proc. Nat. 
Acad. Sd, 90, 1993) analyzed the highly-variable V3 loop of 
the HIV-1 envelop protein. The researchers performed 
robust bootstrapped estimates of the pairwise mutual infor- 
mation for all column-pairs from a set of 31 columns, 
representing V3 residues. They found a set of about seven 
pais that showed considerable and statistically-significant 
mutual information, and their analysis of the particular 
attributes (amino adds) suggested a particular pattern of 
highly likely compensatory mutations. Although the authors 
did not argue or provide evidence for any particular prop- 
erties or relationships being conserved, subsequent muta- 
tional analysis experiments in the laboratory indicated func- 
tional linkage between some of the pairs of sites with high 
mutual information. Because the V3 region is known to be 
both functionally and immunologically important, the inven- 
tor of the instant application suggested that such analyses 
might be important in the search for HIV/AIDS vaccine 
design. 

What Kind of Method is Needed? 

Dearly, several well-studies and effective methodologies 
exist for the comprehensive modelling of protein sequence 
families. In each case, the mathematical machinery is in 
place to handle and detect very local and low-order statis- 
tical structure in the data. In each case, the dif&culties with 
computational complexity and statistical estimation arise in 
the attempt to accoimt comprehensively for all possible 
non-local and higher-order interactions between residues, 
i.e., columns in the aligned sequence data. 
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Easier progress in modelling can be made if one is to use 
HMMs or density networks in conjunction with a fast, 
heuristic preprocessor that focuses exphcitiy on the detec- 
tion of plausible non-local interactions while sacrifidng a 
degree of predsion in modelling these interactions. Such a 
procedure is provided by the principles described herein. 

a) HIV PROTEIN SEQUENCE ANALYSIS 

Tests on an HIV Protein Database 
jQ The Los Alamos HIV database contains, among other 
things, the amino add sequences for the V3 loop region of 
the HIV envelope proteins. This region is known to have 
functional and immunological significance, and the discov- 
ery of sets of sites linked by evolutionary covariation might 
have important implications for understanding and prevent- 
ing HIV infection and replication. 

An earlier and smaller version of the same database was 
used by Los Alamos scientists in their analysis of pairwise 
2Q mutual information between residues (columns). 

Experiments were performed on an HFV dataset with the 
coincidence detection procedure, over a set of different 
values for r and T. Tables of results are shown and discussed 
below. 

25 Results of Experiments on HIV Protein Database 

The aforementioned version of the HIV-V3 dataset was 
edited in order to focus on the thirty-three residues consid- 
ered most conserved and most structurally and functionally 

^ important by the Los Alamos researchers. The dataset there- 
fore consisted of M»657 rows (sequences) of N»33 columns 
(residues). For the coincidence detection procedure, these 33 
columns are transformed into N^N.\A\"33.21«693 
attributes. As with the artifidal datasets, a set of experiments 

35 with different values of T and r were performed. Coind- 
dence detection runs were done with TolO,000 and ro5, 6, 
7, 10 respectively, and with T«100,000 and r«7, and finally 
with T»750,000 and r-7. The results are shown in tables C.l 
through C.9 below. 

40 

Table C.l: The most likely correlated attributes, as estimated 
by the coincidence detection procedure, for the HIV dataset. 
These results were produced with parameter settings T-10, 
000 and ro5. 
45 HIV Dataset. 

T-IO^, 1*5. 



Rank CSET Observed Ejqpected Frob. 



55 



1 


Qn|D24 


1012 


632.553864 


0.316056 


2 


R17fr21 


901 


610.770465 


0.509734 


3 


R12|Q17 


570 


348.605833 


0.675621 


4 


U3|W19|Q24 


195 


5.535741 


0.750381 


5 


N4lK9|A21 


226 


74.167398 


0.831582 


6 


Vll|R12fn8 


159 


20.764346 


0.858239 


7 


R12|n8 
L13|K31 


454 


318.517747 


0.863429 


8 


419 


300J33903 


0.893461 



Table C2: The most likely correlated attributes, as estimated 
by the coinddencc detection procedturc, for the HIV dataset. 
Tliese results were produced with parameter setting TalO, 
000 and r»6. 
65 HIV Dataset. 

T-10;)00, 1*6. 
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Rank 


CSET 


Obseived 


Expected 


Prob. 


1 


Q171D24 


1177 


385.853329 


0.030891 


2 


R17|r21 


957 


368.736702 


0.146238 


3 


HI2U16 
S1o|D24 


1047 


577.583832 


0.294000 


4 


859 


424.457490 


0350274 


5 


R12|Q17 


656 


224.743830 


0.355855 


5 


Ri2in8 


628 


283.191527 


0.516585 


7 


R17|E24 


563 


234.477161 


0.549033 


8 


H12^17 
A18|l21 


760 


434.274580 


0554644 


9 


560 


315.973734 


0.718330 


10 • 


ni|R17 


861 


627.014684 


0.737741 


11 


L13IW19|Q24 


230 


5.365202 


0.755529 


12 


A21|D24 


619 


405.487239 


0.776262 


13 


N4lK9|A21 


237 


25.176801 


0.779367 


14 


Vll(R12tn8 


220 


15.841474 


0.793296 


15 


U3|K31 


462 


267.211446 


0.809942 


16 


GI0jH12 


324 


157.554658 


0.857348 


17 


M13|W15 


245 


84.760597 


0.867059 


18 


QlT^l 


384 


231.749746 


0.879169 


19 


H12|R17lA18 


147 


8.219536 


0.898526 


20 


N4|K9|H33 


309 


170.353419 


0.898711 



Table C.3: The most likely conelated attributes^ as estimated 
by the coincidence detection procedure, for the HIV dataset. 
These results were produced with parameter settings T->10, 
000 and r-7. 
fflV Dataset. 
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•continued 



Rank 


CSET 


Obseived 


Ejqpected 


Ptob. 


6 


K9|I11 


1230 


311.653160 


0.185125 


7 


A18|H33 


1720 


990.576490 


0.345032 


8 


K9|k33 


1125 


405.874883 


0.355482 


9 


H12|A18 


732 


54.213558 


0.399002 


10 


S10|G23 


1492 


856.152048 


0.445479 


11 


N4|k33 


1257 


689.784961 


0.525468 


12 


A18|Q31 


1188 


636.901303 


0.544755 


13 


Q17ID24 


571 


42938312 


0.S72S2S 


14 


VlljRU 


670 


143.659674 


0.574607 


15 


ni|Ri7 


562 


61.788305 


0.606274 


16 


N4|R17 


992 


498.586806 


0.614520 


17 


R121Q17 


484 


31.204991 


0.663619 


18 


K3lty33 


578 


130.131866 


0.669535 


19 


R17[r21 


479 


39.372545 


0.679400 


20 


S10|D24 


451 


34.199456 


0.706491 



Table C.5: The thirty most likely correlated attributes, as 
^ estimated by the coincidence detection procedure, for the 
HIV dataset. These results were produced with parameter 
settings TolOO,000 and a7. 

HIV Dataset. 

25 

T-IOO^KJO, 1^7. 



T-10,000, 1-7. 



30 Rank CSET 



Obseived 



Frob. 



Rank 


CSET 


Observed 


Expected 


Frob. 




1 

2 
3 


1 


017(024 


1312 


228.829775 


0.008322 




4 


2 


N4|K9 


2023 


996.505631 


0.013558 


35 


5 


3 


H12[A18 


1175 


328.263693 


0.053591 


6 


4 


R17fr21 


940 


216.431391 


0.118015 




7 


5 


Q31[H33 


3198 


2481.050915 


0.122699 




8 


6 


R12fn8 


879 


244.789294 


0.193645 




9 


7 


S10[D24 


836 


232.201517 


0.225612 




10 


8 


R12|Q17 


720 


140.866087 


0.254370 


40 


11 


9 


ni|R17 


808 


360,719364 


0.441944 


12 


10 


H12|R17 


659 


253.717115 


0.511491 




13 


11 


R17|A1I8 


720 


361.819054 


0i;92356 




14 


12 


A21ID24 


554 


236.085429 


0.661974 




15 


13 


R17|E24 


452 


13a843412 


0.670137 




16 


14 


L13[K31 


537 


231.137972 


0.682602 


45 


17 


15 


L13|W19|Q24 


292 


5.055474 


0.714573 


18 


16 


Aisfni 

A18|q31|H33 


442 


165.231990 


0.731502 




19 


17 


480 


209.122778 


0.741198 




20 


18 


M13|W15 


355 


88.975694 


0.749122 




21 


19 


K4jK9|H33 


340 


75.556215 


0.751690 




22 


20 


V11|R12 


513 


253.001684 


0.758878 


50 


23 
74 



Table C.4: The most likely correlated attributes, as estimated 
by the coincidence detection procedure, for the HIVE 
dataset. These results were produced with parameter settings 
Tol0,000 and rolO. 55 
HIV Dataset. 



H12|A18 

N4|K9 

Q17ID24 

Q3l|H33 

R17fl71 

R12IQ17 

R12|n8 

S10|D24 

I111R17 



25 Q17|K31 

26 K91A21 

27 P19tD24 

28 Q171A21 

29 G10[H12 

30 H12tE24 



11686 
21853 
11585 
31715 
9355 
7259 
8380 
7666 
8336 
6342 
6363 
7162 
4451 
4673 
5486 
5224 
3519 
4665 
2585 
5967 
3204 
2424 
6209 
4878 
3440 
5614 
3998 
4151 
2661 
3018 



3282636926 
9965.056308 
2288.297747 
24810.509148 
2164.313906 
1408.660868 
2447.892936 
2322015166 
3607.193645 
2360.854285 
2537.171146 
36iai90543 
1388.434119 
1652319901 
2530.016841 
2311379719 

755,562151 
2091.227775 
50.554739 
3574.032278 

889.756945 

117.500168 
4030.321314 
2773.817984 
1450.098718 
3692671816 
2250,071839 
2414.536189 

953.572593 
1458.576938 



0.000000 
0.000000 
0.000000 
0.000000 
0.000000 
0.000001 
0.000001 
0.000009 
0.000109 
0.001550 
0.002543 
0.005941 
0.021747 
0.024130 
0.028256 
0.031348 
0.044291 
0.066951 
0.072672 
0.096592 
0.112364 
0.114017 
0.144077 
0.164117 
0.198651 
0.221632 
0.287354 
0.292077 
0.304245 
0.370622 



T-10^00, 1*10. 



Ranlc CSBT 



Q31|H33 
N4|K9 
S10[F19 
F19iG23 

Riifns 



Obseived 



Expected 



3933 
2898 
2245 
2660 
1155 



883.532458 
251.248235 
907769718 
1588.173503 
142229768 



Frob. 



0.000000 
0.000001 
0.027977 
0.100497 
0.128554 



Table C.6: The first twenty-five of the fifty most likely 
correlated attributes, as estimated by the coincidence detec- 
60 tion procedure, for the HIV dataset. These results were 
produced with parameter settings To750,000 and r-7. Note 
the appearance, at this degree of sampling, of several 
statistically significant higher-order features with k^3. 

HIV Dataset. 



65 



T-750 W 1*7. 
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Rank CSET 



A18Q31|H33 

A21|D24 
H121a18 
H12]R17 

ni|Ri7 

L13|K31 
L13|W19|Q24 

8 M13|W15 

9 N4(K9 

10 N4|K9|H33 

11 Q17|D24 

12 Q31IH33 

13 R12|Q17 

14 Rizfns 

15 R17|A18 

16 R17IE24 

17 R17[Q31 

18 R17fr21 

19 S10|D24 

20 V11|R12 

21 Vll|R12fn8 

22 K3ltY33 

23 N4|A21 

24 Q17|K31 

25 G10|H12 



Obscxved 


Expected 


Piob. 


36019 


15684.208314 


0.000000 


33816 


12392.399254 


0.000000 


45549 


17706.407140 


0.000000 


86025 


24619.776947 


0^)00000 


48257 


19028.783592 


0.000000 


64548 


27053.952336 


0.000000 


39382 


17335.347894 


0.000000 


20184 


379.160544 


0.000000 


23300 


6673.177096 


0.000000 


162152 


74737.922307 


0.000000 


26376 


5666.716129 


0.000000 


86891 


1 1 lO^^J i\}J 


n nnnnnn 
u.uuuuuu 


2331901 


86078.318611 


0.000000 


53740 


10564.956512 


0.000000 


62774 


18359.197022 


0X)0O0O0 


54366 


27136.429076 


ODOOOOO 


33748 


10413.255892 


0.000000 


45065 


26805.242087 


0.000000 


70301 


16232.354294 


0.000000 


57772 


17415.113746 


0.000000 


39546 


18975.126308 


0.000000 


17628 


881.251263 


0.000000 


36346 


20803.634880 


0.000002 


45441. 


30227.409858 


0.000003 


25033 


10875.740384 


0.000018 


20779 


7151.794446 


0.000041 



Rank 



Pairi,j 



MI(c^S) 



Std. Erioi 



10 



15 



20 



25 



Table C.7: Continuation of the fifty most likely correlated 
attributes, as estimated by the coincidence detection 
procedure, for the HIV dataset: csets ranked 26 through 50. 
These results were produced with parameter settings T=750, 
000 and r-7. Note the appearance, at this degree of 
sampling, of several statistically significant higher-order 
features mtb k^3. 
HIV Dataset. 

T-750,000, r-7. 



30 



1 

2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 



12|18 

4|9 

9|21 

23|24 

12|24 

9|24 

19i24 

11|24 

24)26 

9|11 

9123 

4121 

18|21 

4|11 

12|21 

4|24 

21|24 

11123 

11|19 

10|24 

19|23 

5|26 

9|19 

4|23 

24|25 

6|26 

4|26 

6(24 

912 

15|24 

10|12 

9|18 

11|21 

11|12 

4]19 



0.340449 

0.337943 

0319481 

0.315202 

0.314393 

0.313992 

0.305609 

0.297498 

0.290044 

0.289911 

0.285019 

0.284936 

0.278151 

0.277189 

0.273137 

0.262226 

0.260366 

0.260337 

0.249877 

0.248938 

0.242185 

0.239395 

0.238318 

0.23359 

0.222109 

0.220371 

0.220213 

0.218815 

0.214844 

0.213921 

0.2133 

0.21078 

0.210155 

0.209421 

0.20911 



0.037792 

0.0389162 

0J0353829 

0.0337213 

0.0330382 

0.0344732 

0.0335857 

0X3358645 

0.0384839 

0.0344244 

0.0343224 

0.0332236 

0.0404634 

0.0353993 

0i}33385 

0.036189 

0.0338395 

0.0323302 

0.0320634 

0X1325318 

0.032301 

0.0386373 

0.0331283 

0.0302795 

0X3358744 

0.0397722 

0.0333324 

0.0335123 

0.0280984 

0.0301834 

0.0306496 

0X331734 

0.0308121 

0.0294066 

0.0290533 



35 Table C.9: The top seven pairwise inter-column mutual 
information values for the HIV-V3 dataset, as estimated by 
the Los Alamos group. 



Rank CSET 



Observed Escpected 



Piob. 



26 


K91A21 


40098 


27695.038620 


0.000231 


27 


F19ID24 


29121 


16875.538795 


0.000286 


28 


Q17|A21 


29621 


18109.021417 


0.000737 


29 


H12IE24 


22348 


10939.327036 


0.000839 


30 


N4|K9|I11 


15175 


4159.316971 


0.001355 


31 


S4(I9fri2|V18|R21 


10919 


1.718549 


0.001524 


32 


N4|K9|A21 


11233 


623.181959 


0.002185 


33 


N4|Q31|H33 


21868 


11328.342993 


0.002369 


34 


F19|A21 


44400 


34516.144368 


0.004910 


35 


K9|Q31|H33 


16593 


6991.723718 


0.006625 


36 


W19|C324 


16738 


7234.038664 


0.007331 


37 


E1|N12 


10844 


1492.835945 


0.008575 / 


38 


K9|E24 


13847 


4587.312260 


0.009408/ 


39 


K9IR17 


33735 


24568.179150 


0.010326* 


40 


T12|V18 


23076 


14893.617567 


0.026158 


41 


R12|A21 


15497 


7516.155896 


0.03123 ll 


42 


N4|K9|Q31|H33 


8280 


493.681367 


0.0369051 


43 


N4|K9|A18 


11655 


4250.900600 


0.0506181 


44 


S4|T9fn2|V18|R21jY33 


7370 


0.093039 


0.052029 1 


45 


R12|Q17[ri8 


7452 


240.364918 


0.058992 I 


46 


V11|Q17 


14350 


7329.962834 


0.068429 ' 


47 


H12fr21 


23263 


16324.923094 


0.072825 


48 


Q17fy33 


17288 


10374.788061 


0.074203 


49 


L13|W19 


15536 


8921.243955 


0.092437 


50 


S17IH28 


6529 


138,997153 


0.108375 



40 



45 



r 



55 



,60 



Rank 



Pair i, j 



Table C.8: The top thirty-five pairwise inter-column mutual 
information values for the HIV-V3 dataset, as estimated by 
our methodology as described in the main text. 



1 


23(24 


2 


12j24 


3 


12|18 


4 


12|23 


5 


1924 


6 


10124 


7 


10|12 



65 



Tables C.l through C.4 illustrate the most significant csets 
(again measured by our procedure's estimation 
P(Observed\[ndependence) for the Observed number of 
coincidences for each detected coincidence of attributes. As 
one might expect, a clean separation between ''probably 
correlated" and ''probably uncorrelated" does not manifest 
itself at this comparatively low degree of sampling for this 
real-world dataset. Results for r-? and r^lO indicate more 
significant discovered csets than those for ra5 and ra6. At 
these former, higher r values, one sees the emergence of a 
few csets with "Prob" values less than 0.1: (Q@17, D@24,), 
(N@4, K@9), (H@12, A@18), (Q@31, H@23) and (S@10, 
F@19). All of these csets appear among the most significant 
csets reported in the more intensive sampling runs (with 
Tol00,000 and T«750,000), with the notable exception of 
(S@10, F@19). This latter cset is discovered at this low 
legrce of sampling only in the rolO run, and does not appear 
the more intensive sampling runs shown, both of which 
d r-7. 
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Table C.5 displays the results for TolOO,000 and r=7, and 
here it is clear that some separation of signal from noise is 
taking place amongst the set of HOFs, with seventeen 
pairwise and three 3-ary correlations appearing within our 
Prob=0.1 significance level. 5 

At T«750,000, we have more statistically significant 
detection of almost fifty 2-ary, 3-ary and up through 6-ary 
attribute correlations, as shown in Tables C.6 and C.7. 

In order to get a better sense of the possible meanings of 
these results, let us consider these inter- attribute correlations lO 
along with some inter-column correlations in the form of 
pairwise mutual information estimates performed in our own 
analysis and aJso by the Los Alamos group. Table C.8 
displays the highest estimated mutual information values 
amongst all N — N=528 pairs of columns from our 15 
33-column dataset. The estimates were obtained using a 
Bootstrap-like procedure in which 1000 sample data subsets 
of m=300 out of M=657 were drawn and run though the 
standard mutual information calculation. Reported in the 
table are therefore the mean values over the resampling and 20 
the associated standard error values. There is significant 
intersection between the set of column-pairs indicated by the 
top cset values in Tables C.6 and C.7 and those indicated by 
the top mutual information values in Table Hie corre- 
spondence between the two rankings is not perfect, for a few 25 
reasons (besides noise and simple sampling error). First and 
foremost, while the "sui^iciousness'' of a single joint- 
attribute combination certainly contributes to the mutual 
information within the corresponding set of columns the 
behaviour of the other symbols appearing within the col- 30 
umns obviously also can have great effect. Second, we note 
again the observed sensitivity coincidence detection results 
to the choice of r. 

Table C.9 lists the highest statistically significant mutual 
information values as estimated by the Los Alamos group, 35 
We note the overlap between their list and ours, but we 
emphasize again that group's use of an earlier, smaller, and 
perhaps otherwise different database to which we did not 
have access. 

Application of the coincidence detection method of the 40 
invention to biological data such as these aligned HIV 
sequences thus leads to identification of covarying structural 
elements that were previously unrecognized. The statisti- 
cally significant coincidence of particular structural 
elements, such as amino acid residues, likely indicates a 45 
biological role for a motif comprising the covarying 
elements, as structure and function are tightly linked in 
biochemical systems. One sudi example firom the above 
application of the invention is the statistically significant 
coincidence of residues A18, Q31 and H33 in the V3 loop of 50 
HIV envelope protein. These residues are expected to con- 
tribute to a structural motif of the V3 loop that plays a 
biological role in the HIV life cycle. Such new information 
about A18/Q31/H33, which prior to the invention have 
never before beeo grouped together for a particular biologi- 55 
cal role, may be exploited in various ways, as follows. 

A peptide or peptidomimetic mimicking the afore- 
mentioned structural motif of the V3 loop (or another protein 
motif identified by the coincidence detection method) is 
provided by the invention. For the chosen example, the 60 
peptide or peptidomimetic would include spatial coordinates 
of amino acid residues A18/A31/H33, though every atom of 
these amino acids would not necessarily be required. Rather, 
the peptide or peptidomimetic would have such spatial 
coordinates of A18/Q31/H33, as well as topological and 65 
electrostatic attributes, that would make it useful for a 
biobgical function, such as, for example competing with the 
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actual V3 loop of HIV for binding to another biological 
molecule, where such binding of V3 would employ the 
structural motif that is mimicked by the peptide or peptido- 
mimetic. 

Altemativcly, a peptide or peptidomimetic which is 
designed based on covarying k-tuples discovered by the 
coincidence detection method could be used as an antigpn. 
That is, the biological function which the molecule mimics 
is eliciting an immune response in an animal. Similarly, 
vaccines embodying the covaryiog k-tuples described herein 
are also encompassed by the invention. 

Morgan and co-workers (Morgan et al. 1989. In Annual 
Reports in Medicinal Chemistry. Ed.: Vinick, J. J. Academic 
Press, San Diego, Calif., pp. 243-252.) define peptide 
mimctics as "structures which serve as appropriate substi- 
tutes for peptides in interactions with receptors and 
enzymes. The mimetic must possess not only afGnity but 
also ef&cacy and substrate function.** For purposes of this 
disclosure, the terms "peptide mimetic'* and "peptidomi- 
metic** are used interchangeably according to the above 
excerpted definition. That is, a peptidomimetic exhibits 
functk)n(s) of a particular peptide, without restriction of 
structure. Peptidomimetics of the invention, e.g., analogues 
of the structural motif of the V3 loop posited above, may 
include amino acid residues or other chemical moieties 
which provide the desired functional characteristics. 

Hie invention further provides a ligand that interacts with 
a protein having a structural motif identified using the 
coincidence detection method of the invention, as well as a 
pharmaceutical composition including the ligand and a 
pharmaceutically acceptable carrier or exicipient therefor. 
The ligand would include chemical moieties of suitable 
identity and spatially located relative to each other so that 
the moieties interact with corresponding residues or portions 
of the motif. By interacting with the motif, the ligand could 
interfere with function of that region of the protein including 
the motif. 

Thus, the invention provides a pharmaceutical composi- 
tion for interacting with an envelope protein of human 
immunodeficiency virus (HIV), including a ligand having a 
functional group that interacts with the structural motif of 
the V3 loop which has spatial coordinates of residues 
A18/Q31/H33, and a pharmaceutically acceptable carrier or 
exicipient therefor. The ligand may have more than one 
functional group that interacts with the motif, such as, for 
example, a first functional group capable of binding to and 
being present in an effective position in the ligand to bind to 
residue 18, a second functional group capable of binding to 
and being present in an effective position in the ligand to 
bind to residue 31, and a third functional group capable of 
binding to and being present in an effective position in said 
ligand to bind to residue 33. 

The invention further provides a method of designing a 
ligand to interact with a structural motif of an protein, such 
as, for example, envelope protein of human inamunodefi- 
ciency virus (HIV). For example, in the case where the motif 
is the potentially interesting A18/Q31/H33 motif identified 
by the coincidence detection method discussed above, the 
method of designing includes the steps of providing a 
template having spatial coordinates of residues A18, Q31 
and H33 in the V3 loop of HIV envelope proteio, and 
computationally evolving a chemical ligand using an effec- 
tive algorithm with spatial constraints, so that the evolved 
ligand includes at least one effective functional group that 
binds to the motif. The template provided may further 
include topological and/or electrostatic attributes, and the 
effective algorithm include topolbgical and/or electrostatic 
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osnstraints. Similar method steps would be employed for 
other proteins comprising a motif identified by the coinci- 
dence detection method. 

The invention further provides a method of identifying a 
hgand to bind with a structural motif of a protein. The 5 
structural motif is preferably identified by the coincidence 
detection method. For example, in the case where the motif 
is that identified by the coincidence detection method com- 
prising residues Al 8, Q31 and H33 of HIV envelope protein 
discussed above, the method includes the steps of: providing lO 
a template having spatial coordinates of A18, Q31 and H33 
in the V3 loop HIV envelope protein, providing a data base 
containing structure and orientation of molecules, and 
screening the molecules in the data base to determine if they 
contain effective moieties spaced relative to each other so 15 
that the moieties interact with the motif. Tbe data base may 
further contain topological and/or electrostatic attributes of 
the molecules, and the screening step further include deter- 
mining if the moieties are effective in sudi regard for 
interacting with the motif. For example, a molecule 20 
described in the data base may have sudi physical/chemical 
attributes that it incfaides a first moiety that interacts with 
residue 18, a second moiety that interacts with residue 31 
and a third moiety that interacts with residue 33. Similar 
method steps woidd be employed for other proteins com- 25 
prising a structural motif of interest. 

Where a ligand provided by the invention is included in 
a pharmaceutical composition, the pharmaceutical compo- 
sition further includes a pharmaceutically acceptable carrier 
as in known to persons skilled in the art relating to phar- 30 
maceutical compositions. The term "pharmaceutically 
acceptable carrier" as used herein include diluents such as 
saline and aqueous buffer solutions and vehicles of solid, 
hquid or gas phase, as well as carriers such as hposomes 
(Strejan et aL 1984, J, Neuroimmunol 7:27), and dispersing 35 
agents such as glycerol, liquid polyethylene glycols, and the 
like. The pharmaceutical composition may include any of 
the solvents, dispersion media, coatings, stabiUty enhancers, 
antibacterial and antifungal agents (for example, parabens, 
chlorobutanol, phenol, ascorbic acid, thimerosal), isotonic 40 
agents (for example, sodium chloride, sugars, polyalcohols 
such as mannitol) and absorption delaying agents (for 
example, aluminum monostearate and gelatin) which are 
known in tbe art. 

Alternatively, a Ligand provided by the invention, sudi as 45 
a ligand which binds to a biological target, may be employed 
for diagnostic purposes. A diagnostic agent according to tbe 
invention may include a ligand that interacts with a protein 
having a structural motif identified using the coincidence 
detection method, and a detectable label linked to tbe ligand. so 
The detectable label may be any detectable substance known 
in the art, such as, for example, a fluorescent substance or a 
radioactive substance. Alternatively, the label may be an 
enzyme (such as, for example, horseradish peroxidase or 
alkaline phosphatase) which catalyzes a reaction having a 55 
detectable (e.g., colored) product, or the label may be tbe 
substrate for such an enzyme. 

Application of the Principles Described to Drug Discov- 
ery Background 

The mxilti-billion dollar pharmaceutical industry is based 60 
in large part on the design or discovery and refinement of 
small molecules ("Ugands") that interact with larger mol- 
ecules ("targets") and in some way repress, enhance, block, 
accelerate or otherwise modify the structure, function or 
activity of the target. It is the stmcture, function or activity 65 
of the target that is in some way impHcated in some 
mechanism of disease. Tbe target molecule is often an 
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enzyme or protein receptor or nudeic add or some combi- 
nation thereof. There are a great number of possible ligands 
and only some relatively very few of them are developed and 
mariceted as therapeutic compounds that work with or 
against some one or more taigets and thus are effective 
against disease. 

It is therefore of great interest to biotechnology and 
pharmaceutical researchers to be able to consider a huge 
number of potentially useful compounds, but to avoid 
spending too many resources developing therapies based on 
compounds that may turn out not to be useful, safe, effective, 
and economically viable. The methods described herein can 
be used to enhance and accelerate the process of discovering 
good, effective compounds and of distinguishing the prom- 
ising compounds from the unpromising or less promising 
compounds in a public or private collection of molecules or 
their computer database representations. They can be used 
effectively and contribute value in this appUcation in many 
ways, by hewing to understand and infer target structures 
and by finding ligands whose geometric, topological, elec- 
trostatic or other features make them likely candidates for 
effective interaction with the targets. 

Application of the Principles Described Herein to Data- 
bases of Molecules and their Features 

One way to represent a large number of molecular struc- 
tures within a computer database (whether stored in main 
memory, on magnetic disk, tape, or other electronic or 
optical media) is in terms of "screens". Persons skilled in the 
art will recognize screens as binary attributes wherein a 
given screen, or attribute, represents the presence or absence 
of a particular substructure pattern, for example, sulfate 
group. If a set of compounds is represented with screens, 
then a particular compound, which we will denote by C, can 
be represented by a string of Is and Os wherein the Is stand 
for those pre-defined substmcture patterns that C contains 
and the Os stand for those of the pre-defined substructure 
patterns that C does not contain. 

This scheme can be extended to the representation of the 
primary structure of a nucleic acid or protein in terms of 
attributes, as discussed elsewhere herein. The primary struc- 
ture is also known as the "sequence", that is, a sequence of 
bases, or nucleotides, in DNA or RNA, and a sequence of 
amino acids, also called amino acid residues, in a protein. It 
is simple to represent a protein sequence, for example, as a 
sequence of symbols, each symbol being a letter of the 
alphabet corresponding to one of the twenty standard 
naturally-occurring amino adds. It is also simple to trans- 
form this representation by representing each residue, or 
position, in the sequence by a set of twenty binary attributes, 
if such a representation is desired. The attributes act like the 
screens described above. For example, if the first amino add 
in protein P is an alanine, represented by A, it can also be 
represented by a value of "1" in tbe attribute that stands for 
the question, ''Is the amino add in position 1 an alanme?", 
and by values of ''0" for the attributes representing "Is the 
amino acid in position 1 a cysteine?", "Is it a 
phenylalanine?", and so on. FIG. 15a provides an ilhistra- 
don of amino acid and residues positions. 

It is also easy and sensible to represent other aspects or 
features of the compounds in terms of attributes. For 
example, a given compound C may be known to be active 
against a particular target T, in which case an attribute 
corresponding to the question "Active against T?" would 
have value 1 for the object corresponding to compound C. 
For another example, a pharmaceutical company may have 
run a number of compounds through a set of "assays", or 
tests of biological or chemical activity. An assay might test 
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for some aspect of effectiveness against a target, or for 
ability to cross the blood-brain barrier, or for toxicity, for 
example. Assay results can be represented in terms of 
discrete-valued, and even binary attributes as well, via 
preprocessing routines known to persons in the art. Other 5 
features of particular compounds can include literature cita- 
tions (that is, references to papers or studies in which the 
compound was described, designed, discovered or 
analyzed), and ownership or patent status of the compound. 

Not only can small therapeutic compounds be represented 
in terms of screens and other attributes, but so can larger 
potentially therapeutic molecules such as DNA, RNA, 
peptides, proteins, carbohydrates and lipids. Target mol- 
ecules can also be r^resented in this way. All that is 
required is a predefined (though possibly updated, changing, 15 
shrinking or growing) list of substructural patterns or other 
features deemed important by the researchers or users. For 
target structures, one might want to represent substructural 
patterns as well as their 1-dimensional linear structures 
("sequence'^ genetic linkage information, interactions with 20 
other proteins in disease pathways, literature citations, and 
so on. Sometimes a particular molecule might be listed as 
more than one object in a database, the different objects 
representing different conformations that the molecule can 
take. 25 

Clearly, this use of screens and other attributes in repre- 
senting compound databases can also be represented in 
terms of the M by N data matrix we have used to describe 
the working of the invention. The M by N data matrix is 
illustrated below in Table 1. 30 

Hie rows in Table 1 correspond to a set of molecules, 
compounds, molecular structures or sequences, while the 
columns correspond to features that may include substruc- 
tural patterns, assay results or other aspects of the molecules. 
The value in table cell[i,j] is one (1) if molecule i has feature 35 
j and is zero (0) otherwise. 



TABLE 1 





Featuie 1 


Feature 2 


Feature N 


Molecule 1 


1 


0 


1 


Molecole 2 


0 


1 


0 


Molecule M 


0 


0 


1 


. Steps involved in atmlvinp the 


methnHs described herein to 



t he analysis of a molecular data base include: 

1. Obtain m oicciilar database that supports discret e 
Attribute representation for the ID, 2D and/or 3D 

molecular structures ot interest (or, obtain molecula r 50 
database an d use sta ndard methods to produce such a 
representanonj; klS6 us e standard methods to transform 
sequence and dtner intormation about molecules o f 
^i nterest mto attribute represcntatiops . 

2. J ^esegt this database^ in whole or part, to an embodi - 55 
S ^t of the current invention such that each compoun d 

i n me datab ase corresponds to one or more of the M 
o Bjects (rows) in me empodiment^s data matrix and jo 
ffilf^ach screen-represented substru cture pattern^o^r - 
responos to an attn puie (columnjrpflEedata matr ix. 60 
Ihe additionai aunbutcs rcpresen fing activity, a^a y 
r esStlS, kilOWn targe ts ag ainst wfijcb the compoui^ jas 
6een used, source or means ot product^nor storage of 
the compound, ownership or patent s tatus of the 
compou nd, and so op, plus 'the subs taiicture pattern 65 
at jyibules together comprise thriSTattributes (coIum s) 
in the data matrix. 
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3. Employ the base method above or one of the other 
embodiments described herein on the data matrix. 

4. Direct the discovered ooirelated k-tuples of attributes 
to: 

A graphical viewer, or 

A rule-generator preprocessor for rule-based system, or 

A report for users, researchers or managers, or a report- 
generation system, or 

Another computer program that performs some kind of 
further analysis of the compounds, sequences, or struc- 
tures represented in the database, or 

Another computer program that performs some transfor- 
mation or optimization on the database, or 

Another computer program that directs humans and/or 
robots in drug screening experiments or in design, 
refinement or production of therapeutic compounds. 

The output of the current invention, in this drug discovery 
application, can be useful in many possible ways. 

First, it can be used in setting up or optimizing a screen- 
based representation of molecules. For example, it is known 
in the art that a good screen-based representation should use 
a set of screens (attributes) that are mutually uncorrelated 
and roughly equiprobable: The method of the current inven- 
tion would produce, when used as described above, sets of 
correlated screens; this information can be used to add, 
remove, or combine the features that the screens represent, 
in order to make the modified set of screens closer to the 
ideal of imcorrelated and equiprobable. 

Other useful and valuable aspects of the information 
produced by the method include the following. 

For example, it is not uncommon for a pharmaceutical 
company to have good "lead compounds" that work in in 
vivo or in vitro experiments even when the researchers do 
not know the target structure, the active site on the target 
structure, or even which of several proteins in the biological 
system is the target If the methods described herein are used 
to discover correlations among substructural patterns and 
assay results, this information can aid in inferring a target 
structure and designing even more effective lead 
compounds, because it allows researchers to associate struc- 
ture with desired activity. 

Another example is that of finding conelated amino acid 
residues in that part of a drug discovery database cone- 
sponding to an aligned set of DNA, RNA or protein 
sequences, as discussed later herein. In this case, some of the 
correlated k-tuples of residues (positions) may correspond to 
evotutionarily conserved structural and functional relation- 
ships. Therefore the principles described herein can in this 
way be used to help predict or solve the structure and 
function of important biological macromolecules, including 
pharmaceutical targets such as receptors and enzymes. 

Another example is to find correlations between 
structural, functional, disease pathway or other aspects of 
one target molecule, Tl, and another target molecule, T2; or 
finding correlations between structural, functional or other 
aspects of a set of potential therapeutic compounds aimed at 
Tl and those of a set of potential therapeutic compounds 
aimed at T2. In either case, this correlation information is 
useful because it allows drug designers to apply knowledge, 
compounds and techniques effective against Tl to the effort 
against T2. 

Another rather different application of the principles 
described herein to drug discovery and medical science is 
obtained by considering the transpose of data matrix 
described above. Instead of compounds as objects (rows) 
and features of the compounds as attributes (columns), 
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cx)nsider what is possible when the compounds correspond 
to columns and their features correspond to rows. Sec Table 
2 below. Use of the current invention in this scenario 
produces correlated k-tuples of compounds in feature-space. 
These produced k-tuples can embody several kinds of valu* 5 
able information. For example, if the features in the rows 
represent mostly substructural patterns (screens), then the 
produced k-tuples correspond to clusters of compounds. 
Such clustering of compound databases is very useful in 
high-throughput screening (HTS), with both biological/ 
chemical assays (in vitro or in vitro) and computational 
assays. In HTS, it is useful and economical to assay only one 
or a few memljers of each cluster of compounds initially; 
then, only in the cases where a "hit" occurs (that is, a 
compoimd "passes" the "test" in the assay of biological or 15 
chemical activity) do other members of the corresponding 
cluster get sent through the assay. 

Use of the method on the "transpose" of the molecular 
database shown earlier, in order to cluster the compounds in 
feature-space is shown in Table 2. It is now the columns that 20 
• correspond to a set of molecules, compounds, molecular 
structures or sequences, while the rows correspond to fea- 
tures that may include substructural patterns, assay results or 
other aspects of the molecules. There are M' rows and N' 
columns, where perhaps M'»N and N*oM, for the original M 25 
and N described above. The value in table cell[i,j] is one (1) . 
if molecule i has feature j and is zero (0) otherwise. 

TABLE 2 

Molecule 1 Molecule 2 ... Molecule N* 

Feature 1 1 0 ... 1 

Feature 2 0 1 ... 0 

Featuie M' 0 0 ... 1 



Application of the Principles Described Herein to Dis- 
cover and Analyze Genetic Networks 

Advanced molecular biological and computational tech- 
niques applied in large-scale genome mapping and sequenc- 40 
ing efforts are beginning to give us access to the sequences 
of complete genomes, the complete expression patterns of 
genes, and the ability to store and manipulate diis informa- 
tion. Such information can be used to accelerate the discov- 
eiy of new disease targets and successful therapeutic com- 45 
pounds. It is known that the genes that form the "blueprint" 
for particular physical traits and systems within an organism 
often act together in complex ways. Genes interact in 
mutually regulatory ways, promoting, repressing and other- 
wise modulating their own and each others' activation and 50 
expression. 

Traditionally, molecular biology has focused on the study 
of individual genes in isolation. However, to understand 
complex biological phenomena like neural development or 
oncogenesis, for example, it is necessary to study the 55 
expression patterns of tens or hundreds of genes in parallel, 
taking into account temporal patterns as well as anatomical 
patterns. Sudi analysis requires novel computational and 
statistical capabilities,, such as those provided by the prin- 
ciples described herein. so 

While many variations are possible and can be envisioned 
by those in the art, a basic scheme for employing the 
methods described herein in the analysis of genetic networks 
might include the following steps: 

Step 1: Select the genes of interest. 65 

Step 2: Select the biological parameters by which to 
represent the status of a gene at a particular time. 
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Biological parameters can include: expression of a gene 
(concentration levels of the associated mRNA or pro- 
tein product, a particular status of a protein such as a 
biologically relevant phosphorylation or any other 
post-translational modification, the location of a given 
protein, or the presence or absence of a cofactor. For 
example, one can use polymerase chain reaction (PGR) 
techniques to amplify, then use known methods to 
detect mRNA levels for each gene, then normalize 
these by dividing by maximum expression levels for 
each gene, and then quantize these continuously vary- 
ing levels into a set of z discrete levels that can be 
represented in the data matrix format described 
throughout this document It is also possible to use 
concentration levek of protein products as indicators of 
gene activity and interactivity. The change, over timed 
observations, of concentrations of proteins is govemed 
mainly by three processes: direct regulation of protein 
syndiesis from a given gene by the protein products of 
other genes (including auto-regulation as a special 
case); transport of molecules between cell nuclei; and 
decay of protein concentrations. 
Step 3: Select a scheme for time-sampling the biological 
parameters of the genes in the genetic system under 
analysis. At each appropriate time, use methods known 
in the art to measure the selected biological parameters 
for the selected genes. 
Step 4: Represent the selected genes in terms of the 
selected biological parameters, and represent the mea- 
sured values of the biological parameters as attributes 
in the data matrix. Represent the time-samples (the 
instances of measurement of the biological parameters) 
as rows in the data matrix. That is, for a cell in the data 
matrix, in the ith row and jth column, enter the quantity 
or feature measured in the ith time-sample for the jth 
biological paramter (which may correspond to the jth 
gene, or it may not, depending upon whether one or 
more parameters are measured for each gene). The 
recorded quantity, level or feature may be binary (e.g., 
the gene is "on" or "off"), or may be one of z discrete 
values. As described elsewhere in this document, any 
discrete-valued attribute can be represented by a binary 
encoding of whether that value is absent or present in 
a given object, so that any of the preferred embodi- 
ments of the current invention can be applied to data of 
this type. 

Step 5: Employ the base method described above or one 
of the other embodiments described herein on the data 
mauix. 

The output of the above steps, that is, a set of k-tuples of 
correlated attributes, can be interpreted as a set of cliques of 
correlated genes. For example, one might discover that one 
gene is "on'* whenever another gene is "on". Or one might 
discover that when one gene Gl is in ''low expression", 
another gene 02 is ''off"; when Gl is in "medium 
expression", 02 is in "low expression"; and when 01 is in 
"high expression", then 02 is in "medium expression". Such 
a result might lend support to the hypothesis that Gl 
promotes die expression of G2, or that "Gl turns 02 on". 
Similarly, correlated k-tuples of genes or biological param- 
eters might provide evidence that one gene represses, or 
"turns off" another gene or set of genes, and so on. All such 
information can be useful in building a model, for example 
a "boolean network", of a set of interacting genes. Such 
models are known to those in the art as providing valuable 
assistance in diagnosing, preventing and curing disease and 
in designing effective and economically valuable therapeu- 
tics. 
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The rows in Table 3 correspond to a set of time-samples 
(a.k.a., time points, time-slices), that is, times or periods of 
observance of the activity of a particular gene or gene 
product. The columns correspond to particular genes or gene 
products. The value in table cell[ij] is one (1) if gene i is 5 
considered "on'*, that is, e.g., "active" or "expressed", during 
time j and is zero (0) otherwise. This representation and 
application is easily extended to situations in which the 
simple on/off status of a gene is replaced by a set of z distinct 
levels of expression, for example, as measured by observed 
quantities of a gene's main protein product. It is also easily 
extended to situations in which more than one biological 
parameter is used to represent the status of a single gene. 

TABLE 3 15 





Genel 


Gene 2 


Gene N 


Hmc 1 


1 


0 


1 




0 


1 


0 


Hme M 


0 


0' ! 


1 



20 



The methods described herein have been applied to a set 
of gene expression data for genes involved in the develop- 
ment of spinal cord in rats, as described in (G. S. Michaels, 25 
D. B. Can, M. Askenazi, S. Fuifaman, X. Wen, and R. 
Somogyi, Pacific Symposium on Biocomputing 3:42-53, 
1988). The dataset is available from those authors and as of 
March, 1988 is also available over the world-wide web 
(WWW) at http://rsb.info.nih.gov/mol-physiol/PNAS/ 30 
GEMtable.htnil. 

Using a reverse-traoscriptase polymerase chain reaction 
(RT-PCR) protocol, the expression of 112 genes (mRNA 
levels, normalized by maximal expression . level) was 
assayed over nine developmental time points (Ell, E13, 35 
E15, E18, E21, PO, P7, P14, and P90 or adult, wherein 
E«embryonic, and Papostnatal). Included in the list of genes 
used are genes considered important in CNS (Central Ner- 
vous System) development covering nine major gene fami- 
lies. 40 

The dataset mentioned above was easily transformed into 
a data matrix of objects and attributes, convenient for 
analysis with the methods described herein, in a few steps: 

1. The real-valued (that is, continuously-valued) gene 
expression levels were transformed into a set of dis- 45 
Crete values by use of a Bayesian clustering method as 
embodied in the SNOB software, described in (C. S. 
Wallace and D. L. Dowe, "Intrinsic Classification by 
MML — the SNOB program". Proceedings of the Sev- 
enth Australian Joint Conference on Artificial 50 
Intelligence, pp. 37-44, 1994). Bayesian methods of 
quantizing or discretizing real munbcrs are well known 

to persons skilled in the art. For convenience of inter- 
preting output, these six discrete numerical values were 
then further transformed into a small set of alphabetic 55 
symbols, A through F. 

2. A data matrix was set up such that the columns of the 
matrix correspond to the 112 different genes and such 
that the rows of the matrix correspond to the nine 
different developmental time points. 60 

The methods described herein were then run on the trans- 
formed gene dataset input, several times, each time using a 
different combination of values for the parameters r (sample 
size) and T (number of sampling iterations). The method can 
be applied to this dataset by use of a computer program very 65 
similar to the embodiment described in Appendices A and D; 
however, that particular embodiment was tailored for appli- 
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catbn to the protein sequence analysis domain, meaning that 
some of the parameter vahies were fixed to be appropriate 
for those particular trials on the HIV protein data. The 
program must be modified to allow for parameter values 
appropriate to the input data. 

These runs on the gene expression data were performed 
on an IBM PC-compatible computer under the Windows '95 
operating system. For each run, a table of results was printed 
out for viewing and analysis. The results of one run, for 
T-100,000 and r-5, is attached as Appendix E. A researcher 
may wish to only print out the top 10, or 50, or 1000 (or any 
other number) most highly correlated k-tuples of genes.' In 
Appendix E, the top 25 are shown. 

In the attached results printout, the following format 
convention was used: 
Each group of one or more lines r^orts one correlated 
k-tuple of genes, that is, one cset (coincidence set) 
which displayed a low probability of its individual 
component attributes being statistically independent, as 
described elsewhere in this document. Low probability 
of independence is a form of high correlation, as known 
to persons skilled in the art and as explained earlier in 
this document. For each k-tuple, the k genes are shown, 
followed by a numerical value for their probability of 
independence. (This number often displays as zero, 
because the calculated value is so small, so close to 
zero, that the decimal expansion is truncated to zero). 
Again, low probabiUty value means high degree of 
correlation. For each gene, the symbol in A. . . F is 
shown, representing the quantized level of expression, 
followed by the internal dataset name for the gene, 
followed by the more standard accepted name for the 
gene. 

The correlated k-tuples produced can be compared to the 
results reported by the authors in the aforementioned scien- 
tific paper. Among the analysis methods employed by those 
authors on this gene expression dataset was a pairwise 
mutual information analysis. In such analysis, a particular 
correlation measure, known as mutual information, was 
measured for each pair of the 112 genes, and the results were 
displayed graphically so that groups of genes with mutually 
high mutual information tend to appear close to each other. 
The method described herein is able, as shown by the results 
in Appendix E, to discover not only highly-correlated pairs 
of genes, but also 3-tuples, 4-tuples, and so on. Examination 
of the results in Appendix E and the results of the authors of 
the previously cited scientific paper shows that the two 
different methods tend to corroborate each other but that the 
current method goes farther in finding correlations among 
large numbers of attributes. For example, an examination of 
any line of ou^ut of our results reveals a set of correlated 
genes such that the different pairs of genes in that set are 
usually ako listed as having high pairwise mutual informa- 
tion by the other authors' method. 

It is not always true that a correlated k-tuple of attributes 
implies that all possible pairs, from that k-tuple, are also 
mutually correlated, nor vice versa. Therefore, a method like 
those described herein, that can find pairwise and higher- 
order k-ary correlations, offers advantages over pairwise 
methods which can fail to detect important higher-order 
correlations among genes or among other attributes in other 
applications. 

Application of the Principles Described Herein to the 
Discovery of Categories in Internet/Intranet Document 
Databases for Use in Document Search Engines 

Do'^cument search by topic or keyword implies the exist- 
ence of an efficient search engine and, indeed, much effort 
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has been applied to the development of effective seaidii 
algorithms. This, however, only represents a part of the total 
solution — the problem also requires an effective document 
categorization strategy. Information theory dictates that an 
effective set of categories, or topics, used to organize docu- 
ments should be uncorrelated and roughly equiprobable. 
When these topics occur with widely-varying probabilities, 
the search space of documents will be eidier too broadly or 
too narrov^y divided by some topics. If correlations exist 
between the topics (that is, where knowledge of the exist- 
ence of a topic within a given document implies a greater 
probability that other topics will be found within the docu- 
ment as wen) then the topic set can be reduced in size (by 
removing some of the correlated topics from the categori- 
zation set). The ^equiprobability'' ooncem can be addressed 
by the application of the principles described herein. This 
problem yields readily to statistical techniques, but standard 
statistical techniques usually fail to capture higher-order 
joint probability terms. The "decorrelatioo" problem is 
much more subtle and intractable. A sub-optimal topic set 
forces the search engine to examine more such topics than 
necessary before the results can be returned to the users (and 
may confuse interpretation of the organization of the docu- 
ments themselves). Given that every increment in search 
efficiency allows greater numbers of users to use the system, 



A graphical viewer or printer, or 

A rule-generat or preprocessor for rule-based system, or 

A report tor admmistrators or other users oi tlie coin ^ 

puter database query system, or a report-generatio n 

System, ot_ 



10 



15 



20 



Another comnutetDrogram that performs some kind of 
further analysis of the data, for example, performing 
more in-depth statistical analysis (e.g., multiple 
regression) on the correlated variables, or 
Another computer program that performs some trans- 

^ formation or optimization on th e da tabase. 

Any statistically significant correlation between topics in 
the topic set may indicate an ineffective initial dioice of 
topics. The correlated k-tuples discovered by the method of 
the current invention correspond both to "highly correlated 
topics'* (with respect to the "decorrelated topics" goal) and 
to "highly probable joint topics'* (with respect to the 
"roughly equiprobable topics" goal). A person skilled in the 
art can use the correlations output in this application, as a 
guide to determining which topic(s) found to c0'<)ccur 
should be removed or combined from the topic set. Using 



the output of the application in this way would allow the 
the developers of such sj^tems can not affoid a lack of ^d^j^strator of such a document search engine to increase 

the performance of the system by reducing the number of 



effective categorization of documents. 

Application of the method to optimal or near-optimal 
topic set reduction can also be represented in terms of the M 
by N data matrix we have used to describe the working of 
the invention in other sections of this document. In one 
application-specific embodiment, the rows of the data matrix 
correspond to particular documents in the database, and the 
columns correspond to a proposed topic set that is intended 
to categorize them. (See Table 6). 

The rows in Table 6 correspond to documents in a 
database, while the columns correspond to proposed topics 
used to classify them. The value in table cell[ij] is one (1) 
if document i mentions topic j and is zero (0) otherwise. 

TABLE 6 
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Document M 
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^iT^y^lveH inj^pplying current jnvention to a searc h 
:ra_n£ar <)ptimal topic set with which to classify a set of 
do cuments inc lude: 

1. Obtain an mitial topic set. The field of document sear ch 
is well established and cttective met hodolo^es for th e 
creau on ot such sets are known to ttiose skilled in th e 
art. ~ " ~~ 

2. Create the database u si ng this t opic set and the set of 
docume ntsThat the to pic set categorizes, uiven the 
top^ set, ail one need do is exami ne each document to 
determme wtieiher or not It BdetHions eacn topic ! 

3. Present this oatabasc, in whole or part, such that eac h 
d ocumentjn the database c orresponds to one or more o f ^ 
%e M objects (rows) in ttie emPoaiment's data matrix 
a nd so that eac h proposed topic corresponds lo an 
at tribute (column) oi me oaia matrix ? 

4. Employ the base method above or one of the oth er 
e mbodunetltS deiSCrtbe d herein 6n the data matnx . " 65 

5. pirect the discovered correlated k-tuples ot attributes 



categories to be searched in response to a user's query. The 
enhanced performance of the system would benefit the 
provider of the service in two ways: the response time of the 
system to user's queries woiild decrease and the total 
number of users that can be served would increase. 

Applications of the Principles Described Herein to Inter- 
net and Intranet Search and Storage 

Intemet and intranet search engines can be ranked sub- 
jectively by examining the length of time needed for users 
to find sites or documents of relevance to their query. Any 
improvement to the underlying algorithms that drive the 
search engine's output that allows users to find what they're 
looking for sooner improves the usefulness of that engine, 
allows it to serve more users and makes it more attractive to 
both the communities of users and advertisers (in the case of 
internet search) and users and management (in the case of 
company intranet search). Presented below are two uses of 
the principles described herein that will provide ways to get 
relevant information to users sooner and to better manage 
the storage of documents on intemet or intranet search 
systems. In the descriptions and examples below, the prin- 
ciples discussed apply equally whether one is considering 
the intcrnetAveb and hence individual web pages and 
websites, or intranets, maintained within the information 
systems of a single company or other institution, in which 
case the search is for documents rather than websites per se. 

For the purposes of elucidating this description, assume 
that each page in the set of web pages, or internal intranet 
documents in the set of such documents, known to the search 
engine has already been classified by topic and that the set 
of topics is fixed a priori. The goal is to present the user with 
the normal output of the search engine but to supplement 
that list of links with an additional list of topics known to be 
related to the user's request. 

The rows in Table 7 correspond to a set of web pages, or 
internal intranet documents, while the columns correspond 
to topics. The value in table cel][i,j] is one (1) if web page 
or document i mentions topic j and is zero (0) otherwise. 
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TABLE? 





Tbpic 1 


Topic 2 


Tbpic N 


Pagel 


1 


0 


1 


Page 2 


0 


1 


0 


PageM 


0 


0 


1 



Table 7 illustrates the database upon which the base 
method or other embodiment described herein will be run, in 
the data matrix format for representing objects and attributes 
that have been defined and described elsewhere herein. Note 
that, because of the characteristics of the embodiment 
described herein, the nuimber of pages used in the table need 
not be the entire set of all web pages. The embodiment, when 
run (or employed) on this table will find those topics that are 
frequently found in the same document together. This indi- 
cates that these topics are related in some fashion and, as the 
set of web pages supports their association, they may be of 
interest to the user as well 

The advantages are several The computational expense 
of these embodiments scales linearly with respect to the 
number of columns in the database, in this application, the 
number of columns represents the number of topics associ- 
ated with web pages. As this number is almost certainly very 
large, this characteristic of the method is a real benefit. In 
addition, if the web pages are kept in random order, the 
embodiments can be run on more manageable subsets of the 
entire set of web pages. This allows the job of finding these 
associations to be divided into much smaller jobs which can 
be run, serially or in parallel, during idle times on the server 
where the search engine resides. This method can produce 
novel associations of great width (k) at any point during its 
execution. Many other "association mining" methods only 
find longer k-tuples of associated alUributes at later stages in 
their long execution times. Lastly, as the list of associated 
topics found by this algorithm grows, the pages that select 
the links for these new "joint topics" can be created and 
cached. This would reduce server loads (thus allowing more 
users to access the system). As this also puts bounds on the 
statistical relevance of the findmgs, this information could 
be used to select which new topic indices would be cached 
and which would be re-created as needed. 

Alternative Application of the Principles Described 
Herein to Manage the Storage and Retrieval of Web Pages 
and Documents 

Internet and intranet search engines attempt to order the 
space of web pages or documents by topic. Generally, an 
initial (e.g. a^habetic) ordering is not at all likely to evenly 
divide that space. For example, the topic ''California" will 
have a vastly greater set of pages associated with it than will 
"North Dakota". A simple tree-like storage of the pages by 
topic (with sub-topics at lower levels of the tree) will leave 
"California" with a very deep tree. What would be of use in 
this situation would be some better way to divide the search 
space of pages than by just single topics. In the noted 
example, it would be better to have the large set of 
California-related web pages divided into smaller sets closer 
to the size of the set for North Dakota.' We can keep our 
ordering of the pages by topic if we choose to divide larger 
sets into smaller ones by replacing the single topic describ- 
ing the set with a series of associated topic lists that 
encompasses the same space. Going back to our example, if 
"California" were only strongly associated with "Sunshine", 
"Wine" and "Cars" we would replace the tree node "Cali- 
fornia" with the set of nodes ^'California and Sunshine", 
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"California and Wine", "California and Cars", "California 
and Other". This will allow faster lookup and storage of 
these pages because it reduces the height of this part of the 
tree (in this case) by one. Recursively applying the same 

5 technique at all nodes in the tree would provide a method for 
ensuring better balance than could have been had before. 
The only thing missing from this fonnulation of the new tree 
balancing function is the discovery of the associations 
themselves. An application of embodiments described 
herein to the same table discussed in the previous section 
extracts this information from the set of pages. The method 
tcUs us not only which topics are related but also gives an 
indication of the level of support for each association in the 
database. Once a problematically large topic has been 

j5 identified, the list of associations found by the algorithm that 
includes this topic can be consulted to determine how to 
divide the topic. 

The use of tree-based storage retrieval techniques is 
known to those in the art, and such methods include snch 

2Q variations as B-trees, k-D trees, tries, k-D tries, and gridfiles. 
Hashing schemes can also be used instead of, or in addition 
to, tree-based methods per se. With all such methods, there 
are efficiency gains to be made, in both storage (main 
memory and offline memory) and running time, by taking 

25 advantage of particular distributions of the data in the 
application domain. The embodiments described herein can, 
as shown above and in other ways, be used to obtain a better 
understanding of and exploitation of the distribution of the 
data. 

3Q The advantages include all those listed for the first alter- 
native above with one significant addition — if one is already 
using the method to find lists of sites related to a given query, 
then one is already compiling the exact list of associations 
that is needed here to help balance the search tree. 
35 Application of the Principles Described Herein to Sales 
Analysis, Direct Mail and Related Marketing Activities 

Mariceting executives, within retail sales companies, 
advertising/marketing agencies, magazine, newspaper, 
radio, television, film and internet companies, and non-profit 
and charitable organizations, need to know which kinds of 
people are likely to buy or contribute. In all these and other 
marketing contexts, it is very useful and valuable to be able 
to analyze data both from previous marketing campaigns 
(we'll use the term "mailings", though other campaigns and 
45 promotions are also included) and from previous purchases 
of the relevant good and services, or previous contributions 
to charities (let us refer to all these as "products"). 

It is usefiil for marketing executives, salespeople and 
management to know such things as, for example: 
50 Which products tend to be bought together (by same 
customer, perhaps within same transaction)? 
Which of our previous advertising campaigns or mailings 
produced good response (high sales of a product) and 
which did not? 

55 Which demographic factors oonelated with large total 
spending on our companies products last year?Aie 
25-40 year old females in the Midwest region buying 
our products? 

Such questions can be addressed by the analysis of 
60 databases organized in terms of customers, transactions, 
demographic factors, previous marketing campaigns, and 
sales of particular products. For charitable organizations, the 
basic idea is the same, though instead of "sales" and 
"customers" the application is to "contributioos" and 
65 "donors", for example. The principles described herein can 
be applied success^y to these analysis tasks, wherein one 
of the main current computational challenges is the discov- 
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ery of assodations (correlations) amongst sets of variables 
or attributes in very laige databases. Table 8 illustrates the 
application to the analysis of databases on customer pur- 
chases of products. Table 9 is similar except that it illustrates 
the case wherein not only purchases are recorded in the data, 
but also information on previous marketing campaigns. 
Either of these schemes may be augmented by the inclusion 
of additional columns corresponding to demographic 
attributes of the customers, for example region of residence, 
age group, income group, gender, occupational category, 
and participation in community- or leisure-related activities. 

Hie rows in I^ble 8 correspond to customers (and/or 
potential customers), while the columns correspond to prod- 
ucts (goods or services) that were either purchased (denoted 
by 1) or not purchased (denoted by 0) by particular custom- 
ers. The value in table cell[ij] is one (1) if customer i has 
purchased product j and is zero (0) otherwise. 

TABLE 8 





Product 1 


Product 2 


Product N 


Castomef 1 


1 


0 


1 


Customei 2 


0 


1 


0 


Custonm M 


0 


0 


!.! 1 



The rows in Table 9 correspond to customers (and/or 
potential customers), while the columns correspond to mail- 
ings (or other marketing campaigns) and products (goods or 
services) that were either pxirchased (denoted by 1) or not 
purchased (denoted by 0) by particular customers. For the 
Mailing colimins, the value in table ccll[ij] is one (1) if 
customer i was sent mailing j and is zero (0) otherwise. For 
the product columns, the value in table C6ll[iJ] is one (1) if 
customer i has p\irchased product j and is zero (0) otherwise. 



TABLE 9 





Mailing 1 ... 


Mailing q1 


Product 1 .. 


... Product n2 


Customer 1 


1 


0 


0 


1 


Customer 2 ■ 


1 


1 


0 


0 


Customei M 


0 


1 


1 


0 
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Steps involved in applying the principles described herein 
to a sales/marketing database include: 

1. Obtain sales/marketing database as described above. 
Where necessary, use methods known in the art to 
transform continuous-valued variables into discrete- 
state variables. 

2. Present this database, in whole or part, such that each 
customer in the database corresponds to one or more of 
the M objects (rows) in the embodiment's data matrix 
and so that each product or mailing conesponds to an 
atU-ibutc (column) of the data matrix. Mailing attributes 
(if any) plus product attributes together comprise the N 
attributes (columns) in the data matrix. 

3. Employ the base method above or one of the other 
embodk:ients herein on the data matrix. 

4. Direct the discovered correlated k-tuples of attributes go 
to: 

A graphical viewer or printer, or 
A rule-generator preprocessor for nile-based system, or 
A report for marketing peisonnel, magazine/newspaper 
circulation directors, salespeople, managers or other 
users of the computer database query system, or a 
report-generation system, or 



Another computer program that performs some kind of 
further analysis of the data, for example, performing 
more in-depth statistical analysis (e.g., multiple 
regression) on the correlated variables, or 

Another computer program that performs some transfor- 
mation or optimization on the database. 

The output in this application, can be useful in several 
possible ways. 

For example, the output may include conelated k-tupks 
which comprise sets of products that tend to be bought 
together, either within the same transaction or by the same 
customer across different transactions. Sudi information can 
be used to develop "tie-in" and co-marketing campaigns, 
such as, for example, when buyers of NBA basketball tickets 
are given coupons for discounts on NBA team shirts, bas- 
ketball shoes, and other basketball-related merchandise. 
While it is perhaps not surprising that basketball fans like to 
wear NBA team shirts, the steps described above are capable 
of discovering other associations between products that are 
not so obvious. 

For another example, the output may inchide correlated 
k-tuples which represent particular advertising campaigns 
correlated with particular product purchases. Such informa- 
tion can help marketing executives focus their recourses on 
new marketing campaigns of the type most likely to increase 
sales. 

Use of the Principles Described Herein in Qustering 
Customer Data 

Another rather different application of the principles 
described herein to maiketing practice is obtained by con- 
sidering the transpose of the data matrix described above. 
Instead of customers as objects (rows) and products and 
demographic factors as attributes (columns), consider what 
is possible when the customers correspond to columns and 
the product and demographic variables correspond to rows 
(See Table 10). Use of the principles described herein to this 
scenario produces correlated k-tuples of customers, or cus- 
tomer profiles, in the space of demographic and piuchasing 
pattern features. This is seen to be a form of clustering of the 
customer data, into groups of customers or customer profiles 
that are roughly similar in terms of their buying habits and 
lifestyles. Such clustering can be useful in designating 
special "target groups", to enable more optimal allocation of 
marketing resources. Once this transposition of the data is 
envisioned, the other steps apply entirely analogously to the 
descriptions given above for marketing activities. 

Use of the method on the "transpose" of the marketing 
database shown earlier, in order to cluster the customers is 
shown in Table 10. It is now the columns that correspond to 
a set of customers, while the rows now correspond to 
products purchased and demographic features. There are M* 
rows and N* columns, where perhaps M*-N and N'-M, for 
the original M and N described above. The value in table 
cell[)A] is one (1) if customer i purchased product j or 
possesses demographic feature j and is zero (0) otherwise. 

TABLE 10 
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Customer 1 


Customer 2 


Customei N* 


Prod/Demo 1 


1 


0 


1 


Prod/Demo 2 


0 


1 


0 


Prod/Demo M* 


0 


0 


1 
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ApplicatioD of the Priodples Described Herein to the 
Analysis of Medical, Epidemiological and/or Public Health 
Databases 

Medical scientists and practitioners have long known that 
many human diseases and disorders, physical and mental, 
aie caused by complex interactions among many potential 
contributing factors. Such factors can indude particular 
genetic conditions or abnormalities, exposure to biological 
pathogens, aspects of diet, environment (air, water, noise 
pollution), exposure to hazards in the home or workplace, 
emotional stress, substance abuse and poverty, among oth- 
ers. The true '^causes" of a given condition often remains 
impossible to ascertain, though there is much folklore and 
anecdotal evidence offered in attempts to explain some 
instances. The problem of discovery and prevention of 
health threats is helped in recent times by the ability of 
researchers, insurance company representatives, epidemi- 
ologists and public health officials to compile and analyze 
large amounts of data on real people, healthy and sick, living 
and deceased. As in other applications of computers and 
statistical analysis to databases, one must contend in this 
field with a huge number of variables and the exponential 
complexity of their potential interactions. This kind of 
analysis can be improved greatly by methods that efficiently 
find correlations and associations amongst ten, hundreds, or 
thousands of variables. The principles described herein are 
applicable to such a situation. 

Application to medical databases can also be represented 
in terms of the M by N data matrix we have used in other 
sections of this document. In one appUcation-specific 
embodiment, the rows of the data matrix corre^ond to 
particular patients or subjects in a health study; and the 
columns correspond to factors thought to contribute to a 
given disease or set of diseases. Again, these factors can 
include socioeconomic factors, lifestyle (exercise, diet), 
aspects of the patient's home or worlqjlace environment 
(e.g., exposure to carcinogenic chemicals), past medical 
treatments, and so on (See Table 11), 

The rows in Table 11 correspond to patients or to human 
subjects in a study, while the columns correspond to poten- 
tial disease factors. The value in table cell[i,j] is one (1) if 40 
patient i has experienced or been exposed to factor j and is 
zero (0) otherwise. 

TABLE 11 



20 



25 



30 



35 





Factor 1 


Rictox2 


Factor N 


45 


Patleiitl 


1 


0 


1 




Fadest 2 


0 


1 


0 




Patient M 


0 


0 


1 


« 50 



In some application-specific embodiments, there may be 
not just one disease represented imphcitly, but, instead, a 
number of different diseases, represented as attributes along 

with the factors shown in Table 11 and described above. For 55 data, into groups of patients or patient profiles that are 



more of the M objects (rows) in the embodiment's data 
matrix and so that each potential disease factor corre- 
sponds to an. attribute (column) of the data matrix. 
Additional attributes representing different diseases 
plus the disease factors together comprise the N 
attributes (columns) in the data matrix. 

3. Employ the base method or other embodiments 
described herein on the data matrix. 

4. Direct the discovered oonelated k-tuples of attributes 
to: 

A graphical viewer or printer, or 
A rule-generator preprocessor for rule-based system, or 
A report for doctors, researchers, public health officials, 
managers or other users of the computer database 
query system, or a rq)ort-geoeration system, or 
Another computer program that performs some kind of 
further analysis of the data, for example, performing 
more in-depth statistical analysis (e.g., multiple 
• regression) on the correlated variables, or 
Another computer program that performs some trans- 
formation or optimization on the database. 
The ou^ut of this application, can be useful in several 
possible ways. 

For example, the ou^ut may include correlated k-tuples 
which comprise sets of factors associated with one or more 
disease conditions. Such information, perhaps refined 
through further statistical analysis, can provide break- 
throughs in understand, treating, and preventing those par- 
ticular diseases. 

For another example, the output may include correlated 
k-tuples which comprise sets of factors associated with each 
other, such associations being previously unknown. The 
discovery of associated lifestyle factors, such as particular 
diets and obesity or particular professions and high levels of 
alcohol consumption, can itself be useful in improving 
pubUc health policy and medical practice. 

All such discovered correlations can potentially be of 
great benefit to insurance providers, public or private, as 
they must make their actuarial tables and insurance policies 
reflect accurate predictions of health and life expectancy, for 
example, based on lifestyle, socioeconomic and other fac- 
tors. 

Use of the Principles Described Herein in Clustering 
Patient Data 

Another rather different application of the principles 
described herein to public hedth and insurance policy and 
practice is obtained by considering the transpose of the data 
matrix described above. Instead of patients as objects (rows) 
and potential disease factors as attributes (columns), con- 
sider what is possible when the patients correspond to 
columns and the factors correspond to rows. (See Tkble 12). 
Use of the current invention in this scenario produces 
correlated k-tuples of patients, or patient-profiles, in feature- 
space. This is seen to be a form of clustering of the patient 



example, a particular patient p may have lung cancer but not 
diabetes or heart disease, and so row p would have a 1 in the 
column corresponding to lung cancer and have values of 0 
for the cokunns corresponding to diabetes and heart disease. 

Steps involved in applymg current invention to a medical/ 
epidemlological/lifestyle factors database include: 

1. Obtain database of medical/epidemiological/lifestyle 
factors as described above. Where necessary, use meth- 
ods known in the art to transform continuous-valued 
variables into discrete-state variables. 

2. Present this database, in whole or part, such that each 
patient/subject in the database conesponds to one or 
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roughly similar in terms of their lifestyle factors. Such 
clustering can be useful in designating ^dal "low-risk" of 
"high-risk" types of patients or insurance applicants, to 
enable more optimal allocation of healtii services, outreach 
programs, insurance protection, or other resources. Once 
this transposition of the data is envisioned, the other steps of 
the preceding application to analysis of medical and other 
databases apply entirely analogously to the descriptions 
given above, (See Tabic 12). 

Use of the principles on the "transpose" of the disease 
factors databases shown earlier, in order to cluster the 
patients or policy-holders in factor-space is shown in I^le 
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12. [t is now the cotmnns that correspond to a set of patients, 
medical study subjects, or potential insurance policy- 
holders, while the rows now correspond to potential disease 
factors that may inchide lifestyle factors, socioeconomic 
factors, workplace factors, and so on. There are M' rows and 
N' cohimos, where perhaps M'=N and N'=M, for the original 
M and N described above. The value in table celllj,i] is one 
(1) if patient i possesses or has been exposed to factor j and 
is zero (0) otherwise. 



TABLE 12 





Patient 1 


PatLcot 2 


Patient K 


Ftetoi 1 


1 


0 


1 


F&ctoi2 


0 


1 


0 


Factor M' 


0 


0 


1 



Application of the Principles Described Herein to the 
Discovery of the Causes of Failures in Complex Systems 

Administrators of complex integrated systems such as 
computer netwodcs and factory automation systems have 
been faced with the difficult diagnosis problems these sys- 
tem pose since their inception. Where a series of events in 
the system (perhaps over a protracted period of time) leads 
to a failure of the system as a whole, the diagnosis of the true 
cause of the failure can be an ahnost insurmountable task. 
For example, a network interface card on a gateway com- 
puter that fails intermittently when under high load condi- 
tions may not cause the host computer to crash but may lead 
to errors on other computers that use the card (by proxy) to 
service their network requests. Such a problem would be 
difiScult in the extreme to track down using conventional 
diagnosis techniques. Tools that can present administrators 
with a better analysis of the conditions on the system as a 
whole that lead to the failure would speed the diagnosis and 
correction of the underlying problem. 

We need to define the database upon which the principles 
described herein will be applied. 

Ihe database as a whole can be thought of as a state record 
of a series of components over time. The columns of this 
database, when viewed in the data matrix format used 
throughout this document, represent the series of compo- 
nents; the rows represent discrete points in time. The values 
in the table are intended to be an encoding of eadi compo- 
nent's state (on, off, idle, error, and so on) at the time in 
question. Such logging procedures are well known to those 
skilled in the art. 

Ihe rows in Table 13 correspond to points in time, while 
the columns correspond to individual components in the 
system. The value in table cel][ij] is the encoded state of 
component j at time i. 



TABLE 13 





Component 1 


Component 2 


Component N 


lime 1 


1 


0 


1 


Time 2 


0 


1 


0 


Time M 


0 


0 


1 



Steps involved in applying the method of the current 
invention to analysis of a system operations database 
include: 

1. Create a database of system components and their states 
as described above. The choice of state sets for the 
components in the system will be driven by behaviors 
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of interest to the administrators of the system as well as 
by the components themselves. 

2. Ftesent this database, in whole or part, as a data matrix 
such that each column in the data matrix corresponds to 

5 a component in the system and each row in the data 
matrix corresponds to a point in time in the series. 

3. Employ the base method above or one of the other 
embodiments described herein on the data matrix. 

4. Direct the discovered oonelated k-tuples of attributes 
10 to: 

A graphical viewer or printer, or 
A rule-generator preprocessor for rule-based system, or 
A report for the administrators of the system, or a 
report-generation system, or 

15 Another computer program that performs some kind of 
further analysis of the data, for example, performing 
more in-depth analysis on the correlated variables, or 
The output in this application, can be used to indicate the 
events in the system that are typically seen to co-occur with 

20 a given failure. Given the formulation of the database, we 
need not restrict ourselves to the states of the components in 
the system at the time of the failure — we can expand our 
examination of the failure conditions to any range of points 
in time for which the database has records. This allows the 

25 method to help illuminate subtle causal relationships 
between components that ultimately lead to failure. In the 
simplest case, the output can be used to eliminate some 
components in the system from scrutiny if it is seen that they 
are not correlated with the failure. 

30 Application of the Principles Described Herein to the Analy- 
sis of Complex Systems 

Complex systems define a large family of somewhat 
similar appUcations. For the purpose of this discussion, 
complex systems are defined as systems for which three are 

35 no direct detailed modeling approaches because these sys- 
tems comprise a huge number of interacting individual 
components or parts. Examples would include (but would 
not be limited to) economics, individual human behavior, 
productivity in groups of employees, weather patterns, crime 

40 in a nation, etc. In each of these cases, there are no known 
methods to model the system exactly so variables or sets of 
variables are used to measure the state of these systems 
(examples in the case of economics would be the interest 
rate, slock market values and inflation rates). For the pur- 

45 poses of this description, the events in these complex 
systems take the form: pre-condition, action and post- 
condition. These interactions represent the state of the 
system before the actions were taken, the actions themselves 
and the resulting state of the system at some point after the 

50 implementation of the actions. Put another way, the set of 
previous perturbations of the system and their outcomes are 
used as a history of the system from which to derive 
information about the system's characteristics. 
The kinds of databases of complex systems that can 

55 effectively utilize the principles described herein must meet 
certain restrictions. There must be some set of variables 
(either in common usage or derivable from knowledge in the 
domain) used to measure the state of the given system. These 
variables are used in the pre and post condition parts of each 

60 database entry. Additionally, there must be some general set 
of actions that may be appUed to the system that encompass 
methods by which it is known the system may be perturbed. 
Returning to the economics example, the action set would 
include all things under the heading of "fiscal policy. 

65 Formally, the database must include attributes represent- 
ing zero or more pre-condition variables zero or more action 
variables, and zero or more post-conditions variables. Leav- 
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ing aside the trivial case wherein the database contains zero Consider that case of weather patterns. If we focus on the 
pre and post condition variables and zero action variables, post-condition "tornadoes" (that is, we cull the resulting 
there are eight cases to consider. They will be presented correlation set so that it includes only those correlations that 
exhaustively below with examples w;here appropriate. Note involve the appearance of "tornadoes'* in the post- 
that in each case, there are two inteipretations of relevance. 5 conation), then what these correlations tett ns are precursor 
For example consider the case where we have pre-condition ^^^^ ^^^^^^^ immanent, 
variables and action variables but no post-conditions. The ^ 

correlations can be derived in two ways: the database itself The last case is the most general: the database contains all 
could have had no post-condition variables in it (and the three types of variables. Note that a database of this form is 
returned set of correlations is culled to remove any corrc- lo capable of having correlations of attributes of all the pre- 
lations that involved only variables of one type) or it can be ceding types. Example domains have already been given 
that just the set of correlations themselves contain no (economies, crime in a population, etc.) Here the oorrela- 
post-condition variables even though the database does in ^^j^^ thought of as rating actioiK sets (given some set 
fact contain them. For the purposes of the discussion, we pre-conditions) based on the quality of the post- 
assume the former is the case — we can always cull the 15 ^jQjjjj^tjQjjs 
results of the method on a database that has more types of 

variables to leave a set of conelations whidi do not have The last consideration is the types of data that the database 

some types of variables. entries contain. Binary valued attributes, as noted through- 

If the database contains only variables of one type (i.e. out this document, can readily be accepted by this method, 

only action variables or pre or post condition variables) then 20 Other value types must be of limited range of discriete 

the correlations derived from it can be interpreted in one of values. Where this is not the case (i.e. real-valued or 

two ways. If the variables are pre or post condition variables, integer-valued attributes), some transformation must be per- 

then the results indicate situational archetypes-that is sets ^^^^ ^^j^^ ^^^^ ^^^^ ^^^^ 

of attribute values (or equivalently, states of variables) that ^^^^^ ^ manageable number. Various clustering 

tend to be seen together. An example from the domam of 25 ^. , r j j r j 

weatherpattemswouldbe rain and low barometricpiessure. "^^^f^ ^^^^1 . ' 

If only acUon variables are present in the database then weU-known to those skiUed in the art. 

correlations found between them indicate sets of decisions In all cases, the correlations returned by the method are 

that tend to be made together. In a military domain, we might ideal inputs to a case-based reasoning package. Given a 

discover that flanking maneuvers and offensives tended to be 30 condition of the system (i.e. the current condition), a cased- 

seen co-occurring. As these types of databases are very based reasoning tool could use the associations found by the 

similar to others described elsewhere in this document (as principles described herein as a basis for analysis of possible 

would be the appKcations of the method in these cases), this ^^^^^^ ^^^^^^^ ^ ^ 

section will not exphcitly address them. r h t ti» ct 

Hie cases where the database contains variables of only 35 *PP^«^ ^ ™ syslwn. 

two of the three types are three in number. Generally, the principles described herein can be used as 

Correlations found in a database that contains only pre- a tool to aid decision-makers. Decision-makers can be "real" 

condition and action variables describe the relationship or artificial (that is, the method can be used as part of an 

between situations in the domain and the selection of artificial intelligence engine whose purpose is to make 

actions. An example is football play-caUing (note that this 40 decisions in the domain of interest), 

also involves a complex systena that can no t be modeled in Description of the Application of the Principles Described 

any direct detailed way--the play-caller). Here the correla- HercintoDatabases with Pre-condition Variables and Action 

tions mdicate the tendencies of the action-taking entity, e.g., Variables* 
a coach or quarterback. 

If the database contains only action and post-condition 45 Given the above-noted restrictions on the form of the 

variables, then the correlations found elucidate the effec- database, it is clear that the input requirements for the 

tiveness of sets of actions regardless of pre-conditions. application of the embodiments described elsewhere herein 

Going back again to the football example, correlations of are met. In the convenient data matrix representation cited 

this type would illuminate the ability of the team in question elsewhere in this document, the M rows in this context are 

to p^orm certain actions (e.g., if '%ird and long yardage to so the total selected set of pre-conditions and actions taken. If 

first down'' tended to residt in a poor post-condition set, like the entity that applies the actions can sensibly be personified 

fourth down, then we would know that the team tended to be then these rows can represent a history of the decisions made . 

ineffective in this situation). Another inq>ortant example is by the entity and the states of the system at the time they 

drug interaction. In this case, the actions are the drugs given were made. The N columns comprise the set of state 

and the post-conditions are the side-effects reported for 55 variables that define the state of the system and the set of all 

some patient. applicable action variables that describe the ways in which 

While the utiHty of the case where the database contains ^y^^^^ ^e perturt»ed (see Table 14). 

only pre and post condition variables may be unclear on first j- r^,. ^j, j.-* r 

examination, it may well be that this is one of the most ^he rows of Table 14 corresppnd to mstances of or 

usefiil cases. Here we are either interested in things that tend 60 combmations of system states (the pre-condition of the 

to happen after a situation in the given domain regardless of system) foUowed by acUons taken m response to that state, 

actions taken by the decision-maker or we are in a domain while the columns correspond to variables thought to 

where there are no actions that can be taken (or none that describe the state of the system and possible actions that can 

effect the system itself). An example of the former would be be applied to the system. Hie value in table cell[i,p] is an 

the fact that the pre-condition "third and long" in football 65 encoding of the measure of state variable p in event i if 

tends to be followed by the post-condition "fourth and long" . column p is a pre -condition column and is an encoding of the 

In fact, it may be the latter case that is the most interesting. action taken in event i if column p is an action column. 
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TABLE 14 




Pre 1 


Pre j 


Actl 


Actk 


Row 




. C(lj) 


A(lj+1) 


A(lj+k) 


1 

Row 
2 
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A(2j+2) 
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Row 
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A(aij+2) 


A(inj+k) 



s 



There are some other considerations that must be 
addressed prior to the application of the Principles described 
else^ere herein to any given domain. The set of state 
variables must be defined. This is left to those skilled in the 
domain itself (e.g., football coaches, military analysts, etc.) 

Previously noted examples are the case of football play- 
calling by coaches and military decision made by generals. 
In general, preferred implementations of this invention will 
use the method of the cunent invention on databases of this 20 
form in order to extract information about the action-taking 
entity. The correlated state variables and actions describe the 
tendencies of this entity. As noted above, these may be 
further analyzed using case-based reasoning tools to give a 
better picture of the entity's likely decisions given a state of 25 
the system. 

Another use of the invention on databases of this type is 
in discovering fraud indicators in tax collection. Here we let 
the pre-conditions be a set of attributes intended to capture 
the salient details of a tax return (sudi things as total income, 30 
total tax owing as reported by the individual or business, tax 
exemptions claimed, etc.) and choose the action variables to 
define a set of possible tax evasion methods. The correla- 
tions found by the invention then indicate associations 
between types of tax returns and types of tax evasion. As 35 
coincidence detection bounds the retxirned correlations 
statistically, wc not only find indicators of evasion but also 
the reliabiity of these findings. Given that tax collection 
agencies can not afford to investigate all tax returns sent to 
them, this method allows them to find a well-chosen subset 40 
of these returns that is most likely to result in findings of 
fraud (and greater monetary returns for the government), 

TTie last such use that will be presented as in the domain 
of insurance fraud and is very similar to the application of 
the principles described herein to tax collection. The pre- 45 
condition variables are intended to capture a set of details in 
an insurance claim that are thought to be possible indicators 
of fraud (amount claimed, specifics concerning the insured 
entity, etc.) and the action variables represent types of fraud. 
The results found when the principles described herein are 50 
applied show correlations between the details of insurance 
claims and types of fraud. Insurance companies can not 
investigate .all claims sent to them; so, the applications of the 
principles described herein will narrow the total list of such 
claims to a set more likely to be the subject of fruitful 55 
investigations. 

Steps involved in applying thprinciples described herein 
to a database containing pre-condition and action variables 
include: 

1. Create the database of system states and actions taken 60 
by the action taking entity as described above. Where 
necessary, use methods laiown in the art to transform 
continuous-valued attributes into discrete-state 
attributes. 

2. Present this database, in whole or part, such that each 65 
states/action set corresponds to one of the M objects 
(rows) in a data matrix and so that each state type 
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aspect and action type corresponds to an attribute 
(column) of the data matrix. 

3. Employ the base method or other embodiment 
described herein on the data matrix. 

4. Direct the discovered correlated k-tuples of attributes 
to: 

A graphical viewer or printer, or 

A report for decision-makets, or a report-generation 

system, or 

. Another computer program that will use the correla- 
tions found as a basis for making decisions (for 
example, a case-based reasoning package), or 
Another computer program that performs some trans- 
formation or optimization on the database. 

This application of the princQ)les described herein pro- 
vides and utilizes a list of correlated state/action sets that 
give insight to the inclinations of the action-taking entity. 
Were one to be interested solely in one system state (or in 
only a few aspects of a given state), for example the current 
state, one could cull the results of any correlations that do 
not share a given set of a^ects with that state. The resultant 
set would represent correlations between the aspects of 
interest and the actions taken in response. The resulting 
insight into the action-taking entity's methodology can be 
used in fiirther decision-ma^g. 

DesCTiption of the Principles Described Herein as Applied 
to Databases with Pre-condition Variables and Post- 
condition Variables: 

Here, too, the above-noted restrictions on the form of the 
database force compliance with the input requirements of 
the embodiments described elsewhere herein. The M rows in 
this context are the instances or combinations of pre- 
conditions and post'KX)nditions (viewed together, one can 
think of these rows as being the system's transitions between 
states). The N columns are comprised of the set of state 
variat^les that define the state of the system before and after 
the transition (see Table 15). 

The value in cell[ij] of Table 15 is an encoding of the 
measure of state variable j either before or after the transi- 
tion. 



TABLE 15 
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There are some other considerations that must be 
addressed prior to the application of this invention in any 
given domain. The set of state variables must be defined. 
This is left to those skilled in the domain itself. 

Equally important is the selection of time quanta that 
define the granularity of the transitions. This too is left to 
those skilled in the art to decide based on their own e]q)ert ise 
and the kinds of information they wish to extract. It is 
assumed that some minimum granularity is imposed by 
either the complexity of gathering such data or by the limits 
of the usefulness of such data. Given this, one can then pick 
any multq)le of this minimum granularity to be the time 
between pre and post conditions. At the very least, this 
distance in time should be long enough for the system to 
have changed its state. 

Possible domains of application for this invention include 
economics and fiscal policy, stock maiket prediction, ath- 
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Letic talent scouting and weather piedictioii. Presented below 
are brief descriptions of each in turn to show how these 
problems may be organized to fit the specifications of the 
method of the current invention. 

In the domain of economics and fiscal policy, we propose 
a database of sets of states where the states are a set of 
economic indicators (inflation and interest rates, housing 
starts, GDP and so on). Each row in the database should 
contain two such states (the pre and post condition of the 
system) separates by a fixed amount of lime. The correla- 
tions found in by the method of the current invention then 
give insight into cycles in the economy. 

For stock market prediction, we propose a set of stocks 
(presumably large) which are thought to have influence over 
one another. Again, a fixed period of time is selected for 
transitions. Ihc rows of this database then tell the transition 
of these stocks over the chosen period of time. The output of 
the invention then indicates which sets of stocks "move" in 
a correlated manner over that period of time. 

Athletic talent scouting (e.g., by professional teams prior 
to a draft of young players) would involve an examination 
of the history of such selections. Each row of the data matrix 
would then pertain to an individual player. The pre-condition 
state is a selection of statistics (and any other information 
available about the player) thou^t to be indicative of future 
performance at the professional level. The post-condition 
state would then be some set of variables intended to 
measure that player's success at the professional level. The 
correlations discovered by the invention would help teams 
find the best set of indicators of future success with whidt 
to make their selections. Note that in this case, the pre and 
post conditions need not be of exactly the same form. Hiere 
is no intended restriction on state representations to force 
them to be equivalent. 

Weather predication is a very straightforward application 
of this invention Here the granularity of the selected time 
quantum is based solely on the kind of information the user 
wishes to discover. Put another way, the time quantum 
determines the degree of prediction desired. If we choose a 
single day, then the correlations found by the method will 
help us predict the weather (given a set of values for each of 
the pre-condition variables that describes the current 
weather) a day in advance. If a week (or a month etc.) is the 
chosen quantum, then this is how far into the future the 
predictions will extend. 

In general, preferred embodiments of this invention will 
use the method of the current invention on databases of this 
form in order to extract information about how the current 
state of the system acts as a predictor for a future state. 
Given probabilistically bounded data conelations between 
states of the system, effective predictions can be made about 
the system* s behavior. 

Steps involved in applying current invention to a database 
containing pre-condition and action variables include: 

1. Create the database of transitions between system 
states, wherein a system state is represented by a value 
of a state variable, over the chosen time quantum as 
described above. Where necessary, use methods known 
in the art to transform any continuous-valued state 
variables into discrete-state variables. 

2. Present this database, in whole or part, sudi that each 
state to state transition set corresponds to one of the M 
objects (rows) in the embodiment's data matrix and so 
that each state variable corresponds to an attribute 
(column) of the data matrix. 

3. Employ the base method or other embodiment 
described herein on the data matrix. 
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4. Direct the discovered correlated k-tuples of attributes ' 

to: 

A graphical viewer or printer, or 
A report for decision-makers, or a report-generation 
5 system, or 

Another computer program that will use the correla- 
tions found as a basis for making decisions (for 
example, a case-based reasoning package), or 
Another computer program that performs some trans- 
10 formation or optimization on the database. 

Description of the Application of the Principles Described 
Herein to Databases with Action Variables and Post- 
condition Variables: 
Here, too, the above-noted restrictions on the form of the 
15 database force compliance with the input requirements of 
the embodiments described elsewhere herein. The M rows in 
this context are the total selected set of actions and post- 
conditions. The N columns are comprised of the set of state 
variables that define the state of the system before and after 
20 the transition (See Table 16). 

The rows of Table 16 correspond to observed instances of, 
or hypothetical combinations of, actions applied to the 
system and their resulting system states. The columns cor- 
respond to either possible actions that' can be applied to the 
25 system or are individual state representation variables. If 
column p corresponds to one of the action types in the 
database, the value in table cell[i,p] of Table 16 is an 
encoding of the action taken. If column j is a column used 
to indicate some aspect of a state of the system, then the 
30 value in the table cell[ij] is an encoding of the measure of 
that aspect. 
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As noted in previous examples, decisions that must be 
made prior to the application of the method of the current 
invention to databases of this type include the choice of state 
variables used to store the state of the system at a given point 

45 in time and the choice of time quantum used to temporally 
separate the actions fi-om the post-conditions. These choices 
are left to those skilled in the domain of appHcation. The 
time quantum chosen must, in the most trivial case, be long 
enough for the actions to have had some effect on the state 

50 of the system. 

Possible uses of this invention include such widely vary- 
ing fields as player management in hodcey and the study of 
drug interaction. 
For the purposes of this document, player management in 

55 hockey concerns only the selection of players for the next 
shift on the ice givrai knowledge of the history of these 
players. The action variables in this case are binary values 
indicating whether or not a player is selected for the shift 
while the post-condition variables comprise a set of out- 

60 comes within the domain of hockey (sudi things as the 
relative score in that shift, penalties called, the length of any 
penalties, relative nimiber of shots taken, etc.). By the 
formulation of the problem, it is dear that the discoveries 
produced by the invention indicate correlations between sets 

65 of players chosen and outcomes on the next shift. In situa- 
tions where the opposing players are known a prior, these 
players can be added to the action variables. In this case, we 
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will find correlations between sets of players, both for our 
team and against it, and outcomes. Given this knowledge tbe 
invention is useful as an aid to coadies in selecting players 
most likely to produce beneficial results. 

TTie study of drug interaction is a natural fit for this 
invention. Here we let the action variables be binary values 
indicating whether or not a given patient has been admin- 
istered some drugs or combination of drugs. The post- 
condition variables indicate the list of side effects reported 
by the patient. The results found by the invention then 
indicate statistically bounded correlations between sets of 
drugs given to patients and side efiFects. In this fashion, the 
method of the current invention can be used to determine 
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before and after the transition as well as the encoded actions 
types (see Table 17). 

The rows of Table 17 conespond to instances or combi- 
nations of pre-condition, actions taken and the resulting 
post-conditions. The cohmins correspond to types of actions 
possible in the domain as well as aspects of interest to any 
given situation in the domain (for both pre and post condi- 
tion columns). If column p corresponds to one of the action 
types in the database, the value in cell[i,p] of Table 17 is an 
encoding of the action taken. If column p is a column used 
to specify some aspect of either the pre-condition or the 
post-condition, then the value in table cell[ij] is an encoding 
of the measure of that aspect. 
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contra-indications in the use of drugs but is perhaps best 
suited as a way to select sets of interactions upon which to 
focus further study. 

Steps involved in applying current invention to a database 
containing action and post-condition variables include: 

1. Create the database of transitions between system states 
and actions over the chosen time quantum as described 
above, wherein a system state is represented by a value 
of a state variable and an action is represented by a 
value of an action-type. Where necessary, use methods 
known in the art to transform continuous-valued state 
variables and action types into discrete state variables 
and action types. 

2. Present this database, in whole or part, to an embodi- 
ment of the current invention such that each action 
set/state set pair corresponds to one of the M objects 
(rows) in the embodiment's data matrix and so that 
each slate variable or action type corresponds to an 
attribute (column) of the data matrix. 

3. Employ the base method or other embodiment 
described herein on the data matrix. 

4. Direct the discovered correlated k-tuplcs of attributes 
to: 

A graphical viewer or printer, or 
A report for decision-makers, or a report-generation 
system, or 

Another computer program that will use the correla- 
tions found as a basis for making decisions (for 
example, a case-based reasoning package), or 
Another computer program that performs some trans- 
formation or optimization on the database. 
Description of the implication of the Principles Described 
Herein to Databases with Pre-condition Variables, Action 
Variables and Post-condition Variables: 

Here, too, the above-noted restrictions on the form of the 
database force oomplianoe with tbe input requirements of 
the embodiments described elsewhere herein. Hie M rows in 
this application are the total selected set of pre-conditions, 
actions and post-conditions. Tbe N columns are comprised 
of the set of state variables that define the state of the system 



25 As noted in previous examples, decisions that must be 
made prior to the application of the method of the current 
invention to databases of this type include the choice of state 
variables used to store the state of tbe system at a given point 
in time and tbe choice of time quantimi used to temporally 

30 separate the actions firom the post-conditions. In this case, it 
should be noted that it is not necessary for tbe pre and 
post-conditions to be equivalent (with respect to the choices 
of variables). These choices are left to those skilled in the 
domain of application. The time quantum chosen must, for 

35 example, be long enough for the actions to have had some 
effect on the state of the system. 

Possible used of this invention include economic policy, 
crime-fighting and military strategizing. 

Given some set of variables to define the state of an 

40 economy (interest rates, inflation, GNP and so on) and a set 
of actions taken as part of the governing body's economic 
policy (issuing and buying back government bonds, etc.), we 
create a database of economic events of the form: existing 
economic state, fiscal policy measures taken and economic 

45 state following the policy decisions. The correlations found 
by the method of the current invention give a measure to the 
effectiveness of economic policy decisions, given a state of 
the economy. Such knowledge would be beneficial in decid- 
ing economic policy as it would show historical support (or 

50 the lack thereof) for a given set of decisions. 

In a similar vein, the use of the current invention to aid in 
setting anti-crime policy starts with the creation of a data- 
base of previous states of the community's crime, policy 
measures taken and the resulting state of crime in tbe 

55 community. The state variables could include things like the 
rates for differing types of crimes (breaking and entering, 
auto theft, etc.), differing characteristics of crimes (i.e. 
whether or not handguns were used etc.) and so on. The 
action variables in this case could include such things as 

60 minimum sentencing guidelines for various crimes, "three- 
strike" laws, the adoption of the death penalty, as weE as 
education and mental health fundiiKg. On such a database, 
the invention would find correlations involving existing 
crime states, policy decisions and tbe outcomes of those 

65 decisions. It is proposed that these correlations could prove 
an invaluable aid to those charged with making such deci- 
sions. 
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The concept of the "decision-maker" needs careful con- 
sideration in the domain of military strategy. It may well be 
the case that there is not enough of a "track record" to fill a 
database with enough of a history of any one general's 
decision making. In such a case, preferred implementations 5 
can extend the concept of the decision-maker to include all 
similar decision-makers. As an example, consider a single 
general commanding a tank division. If the general were 
recently promoted, one would be wise to consider all the 
history of all such generals of the same allegiance. To lo 
increase further the granularity of the use of the method, the 
database could be filled with the decisions made by all 
infantry Eeutenants rather than with those of any one lieu- 
tenant. Correlations found would be indicative of the ten- 
dencies of that class of generals given some measure of the 15 
battlefield conditions faced when they made their decisions. 
Equally, one would be in a position to determine which 
battlefield situations they handled poorly because one has 
access to the outcomes of the decision sets. Such knowledge 
could prove vital to selecting an opposing strategy. 20 

Steps involved in an application of the principles 
described hexein to a database containing pre-condition, 
action and post-condition variables include: 

1. Create the database of states and actions covering the 
chosen time quantum as described above. Where 25 
necessary, use methods known in the art to transform 
continuous-valued state variables and action types into 
discrete state variable and action types. 

2. Present this database, in whole or part, such that each 
statc/action/state triple corresponds to one of M objects 
(rows) in a data matrix and so that each state variable 
or action type corresponds to an attribute (column) of 
the data matrix. 

3. Employ the base method or other embodiment 
described herein on the data matrix. 

4. Direct the discovered correlated k-tuples of attributes 
to: 

A graphical viewer or printer, or 

A report for dedsion-makeis, or a report-generation ^ 

system, or 

Another computer program that will use the correla- 
tions fouind as a basis for making decisions (for 
example, a case-based reasoning package), or 
Another computer program that performs some trans- 45 
formation or optimization on the database. 
It will be understood by those skilled in the art that this 
description is made with reference to the preferred embodi- 
ment and that it is possible to make other embodiments 
employing the principles of the invention which faU within 5Q 
its spirit and scope as defined by the claims on the pages 
following Appendices A through E attached hereto, v/hich 
y^pendices form a part of this description. 
What is claimed is: 

1. A coincidence detection method for use with a data set 55 
of objects, said objects having a number of attributes, the 
method comprising the steps of: 

sampling various subsets of the data set for a plurality of 
iterations, each iteration the sampled subset of the data 
set having for each obj ect the same subset of attributes; go 

detecting, and recording counts of, coincidences in eadi 
sampled subset of die data set, a coincidence being the 
co-occurrence of a plurality of attribute values in one or 
more objects in a sampled subset of the data set, where 
the plurality of attribute values is the same for each 65 
occurrence, the detecting and recording counts of coin- 
cidences in each sampled subset of the data set being 



performed before, at the same time or after sampling, 
detecting and recording counts of coincidences in other 

subsets; 

determining an expected count for each coincidence of 
interest that has been detected in the previous step; 

comparing, for each coincidence of interest, the observed 
count of coincidences versus the expected count of 
coincidences, and from this comparison determining a 
measure of correlation for the plurality of attributes for 
the coincidence; and 

reporting a set of k-tuples of correlated attributes, where 
a k-tuple of correlated attributes is a plurality of 
attributes for whidi the measure of correlation is above 
a respective pre-determined threshold. 

2. The coincidence detection method of claim 1, wherein 
the comparison of observed and expected counts is calcu- 
lated using a Chemoff boimd on tail probabilities, 

3. The coincidence detection method of claim 1, wherein 
the counts are recorded by storing a running total of the 
count of each coincidence over all of the sampled subsets. 

4. The method of claim 1, wherein the steps of the method 
are represented by the following pseudo-code: 

0. begin 

1. read (MATRIX); 

2. read (R, T); 

3. compute _first_ord6r__jiiarginals(MArRIX); 

4. csets: ={}; 

5. for iter-1 to T do 

6. sampled_rows:»rsample(R,MArRIX); 

7. attributes: -get _attributes(san^)led jows); 

8. all_coincidences: =find_all __coiacidences 
(attributes); 

9. for coincidence in all_coincidences do 

10. if cset_already_exists(coincidenoe, csets); 

11. then update_cset(coincidence, csets); 

12. else add_jiew_cset(coincidence, csets); 

13. endif 

14. endfor 

15. endfor 

16. for cset in csets do 

17. expected: »compiite_expected__matclL_count(csct); 

18. observed: «get__observed_match_count(cset); 

19. stats: =update_stats(cset, hypoth_test(expected, 
observed)); 

20. endfor 

21. print__final_stats(csets, stats); 

22. end. 

5. The coincidence method of claim 1, further comprising 
the step of representing the objects and attributes in a matrix 
of objects versus attributes prior to sampling the data set, the 
data set being sampled by sampling the matrix. 

6. A method comprising: 
the method of claim 1, and 

the further step of: 
applying rules that are defined by the reported correlated 
attributes. 

7. The method of claim 1, further comprising the steps of 
first creating a database of transitions between system states, 
wherein a system state is represented by a value of a state 
variable, over a chosen time quantum, and presenting the 
database, in whole or part, as a data set such that each state 
to state transition set corresponds to one of the objects and 
so that each state variable corresponds to an attribute. 
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8. The method of claim 1, further comprisiug the steps of 23. The method of claim 18, the method farther compris- 
first creating a database of states and actions covering a ing the step of separating the data set into subsets for 
chosen time quantxm] and presenting the database in whole sampling. 

or part, as a data set such that each state/action/state triple 24. The method of claim 18, wherein more than one 

corresponds to one of the objects and so that each state 5 subset of objects is sampled and the size of the subsets of 

variable or action type coiTcsponds to an attribute. objwts sampled is a constant. ^ 

9. The method of claim 1, wherein at least one of the ^ 25. The method of clami 18, wherem the companson of 
objects corresponds to a biological sample from a subject ^^^^^^ attribute comcidences to said expected countis 
ani at least Je of the attributes corresponds to a biological ^^^.^e^,^^^^^^^^^ 

P^.r^nl^'^°?''J?''^^o .1 * thestepof storing arumiingtolal of countsofeachaltribute 

10. The method of claim 9, wherem at least one of the coincidence detected over aU the subsets sampled, 
attributes corresponds to a phcnotypic aspect, 27. The method of claim 18, wherein at least one of the 

11. The method of claim 9, wherem at least one of the ^^^j^ts corresponds to a biological sample from a subject 
attributes corresponds to expression of a gene. and at least one of the attributes corresponds to a biological 

12. The method of claim 11, wherein the expression of at 15 parameter of a gene or gene product. 

least one gene is measured by mRNA. 28. The method of claim 27, wherein at least one of the 

13. The method of claim 11, wherein the expression of at attributes corresponds to a phenotypic aspect. 

least one gene is measured by protein product. 29. The method of claim 27, wherein at least one of the 

14. The method of claim 9, wherein at least some of the attributes corresponds to expression of a gene. 

objects correspond to biological samples from a single 20 30. The method of claim 29, wherein the expression of at 

subject collected over time and at least one of the attributes least one gene is measured by mRNA. 

corresponds to expression of a gene. 31. The method of claim 29, wherein the expression of at 

15. The method of claim 14, wherein the expression of at least one gene is measured by protein product. 

least one gene is measured by mRNA. ^2. The method of claim 27, wherein at least some of the 

16. The method of claim 14, wherein the expression of at 25 objects correspond to biological samples from a single 
least one gene is measured by protein product. collected over time and at least one of the attributes 

17. The method of daim 1, wherem said pluraUty of corresponds to expression of a gene. 

iterations is a predetermined number of iterations. , ' ^^^^ f> "^l^Jf^ expression of at 

. .A J * *!. jr J * * least one gene is measured by mRNA. 

18 Acomcidence detecUon method for use with a data set 3^ .j^^ ^^^^^^ expression of at 

of objects, each of the objects having at least one attribute, 30 ^^^^ ^ mcasuttd by protein producf. 

the method comprismg the steps of: 35 ^he method of claim 18, wherein at least one of the 

(1) sampling various subsets of the data set for a plurality objects corresponds to a subject and at least some of the 
of iterations, each iteration the sampled subset of the attributes correspond to the subjects* genes or gene expres- 
data set having for each object the same subset of sion patterns and the presence of a particular drug side-effect 
attributes; or side-effects after having been administered a particular 

(2) detecting attribute coincidences based on results of ^ 

sampUng of the data set; The method of claim 18, wherem at least one of the 

7r ... J objects corresponds to a subject and at least some of the 

(3) recording attribute comcidences; and attributes correspond to the subjects* genes or gene expres- 

(4) comparing at least one recorded attribute coincidence 40 sion patterns and the presence of a particular drug side-effect 
count to at least one expected attribute coincidence or side-effects after having been administered a particular 
count, wherein the expected attribute coincidence count combination of drugs. 

is determined for a coincidence that has been detected 37, The method of claim 18, wherein at least one of the 

in the preceding steps. objects corresponds to a subject and at least some of the 

19. The method of claim 18, the method further compris- 45 attributes correspond to the subjects' genes or gene expres- 
ing the step of: sion patterns and the response of the subject to treatment 

(5) reportmg attribute coincidences based on result of said using a particular drug. 

comparison. 38. The method of claim 18, wherein at least one of the 

20. The method of claim 19, the method further compris- objects corresponds to a subject and at least some of the 
ing die step of applying rules that are defined by the reported 50 attributes correspond to the subjects' genes or gene expres- 
attribute coincidences. sion patterns and the response of the subject to treatment 

21. The method of claim 19, wherein step (S) comprises using a particular combination of drugs. 

the step of reporting at least one numerical correlation value 39. The method of claim 18, wherein said plurality of 

and at least one k- tuple of correlated attributes. iterations is a predetermined munber of iterations. 

22. The method of claim 18, wherein the objects and the 55 

attributes of the data set are represented in a matrix. « • « « 
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It is certified that error appears in the above-identified patent and that said Letters Patent is 
hereby corrected as shown below: 



Title page> 

Item [57], ABSTRACT, 

Line 3, replace "of the data set of sampled" with - of the data set are san5)Ied 
Column 4, 

Line 4, replace "(k-/th-order)" with -- (k-lth-order) 
Column 6, 

Line 2, replace "(k-/th-order)" with -- (k-lth-order) 
Column 19, 

Line 13, replace "feature" with - features 

Colunm 23, 

Line 43, replace "attributed" with -- attributes --. 
Column 71, 

After line 52, before "What is claimed is:", insert the attached 34 pages of Appendices. 
Column 72, 

Line 28, replace "4. csets: ={};" with -- 4. csets :={}; 

Line 30, replace "6. sampled_rows:=rsan[q)le(R,MATRIX):" with 

-- 6. sampled_rows :=rsample(R,MATRIX); 

Line 31, replace "7. attributes: =get" with 7. attributes :=get -. 

Line 44, replace "17. expected: ^compute" with - 17. expected :=compute ~. 

Line 45, replace "18. observed: =get" with - 18. observed :=get 

Line 46, replace "19. stats: =update" with - 19. stats :=update --. 
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Column 73. 

Line 44, replace "steps." with step (2). --. 
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